Mapping the Unseen
  • Hands-on Exercises
    • Hands-on Exercise 1.1
    • Hands-on Exercise 1.2
    • Hands-on Exercise 2.1
    • Hands-on Exercise 2.2
    • Hands-on Exercise 3.1
    • Hands-on Exercise 4.1
    • Hands-on Exercise 5.1
    • Hands-on Exercise 5.2
    • Hands-on Exercise 6.1
    • Hands-on Exercise 7.1
    • Hands-on Exercise 8.1
    • Hands-on Exercise 9.1
    • Hands-on Exercise 10.1
    • Hands-on Exercise 10.2
  • In-class Exercises
    • In-class Exercise 1
    • In-class Exercise 2
    • In-class Exercise 3
    • In-class Exercise 4
    • In-class Exercise 5
    • In-class Exercise 6
    • In-class Exercise 7
    • In-class Exercise 8
    • In-class Exercise 9
    • In-class Exercise 10
  • Take-home exercise
    • Take-home Exercise 1
    • Take-home Exercise 2
    • Take-home Exercise 3
  • Home
  • About

On this page

  • 1 Overview
    • 1.1 Objective
    • 1.2 Context
    • 1.3 Methodology Overview
  • 2 Data Upload and Initial Setup
    • 2.1 Installing and launching the R packages
  • 3 Data Import and Preparation
    • 3.1 Primary Dataset
    • 3.2 Secondary Data — Locational Data
    • 3.3 Spatial Autocorrelation Check
  • 4 Exploratory Data Analysis (EDA)
    • 4.1 Descriptive Statistics
    • 4.2 Visualize Relationships
    • 4.3 Spatial Distribution Analysis
    • 4.4 Correlation Analysis
  • 5 Building a Hedonic Pricing Model by using Multiple Linear Regression Method
    • 5.1 Multiple Linear Regression Method
    • 5.2 Preparing Publication Quality Table: olsrr method
    • 5.3 Preparing Publication Quality Table: gtsummary method
    • 5.4 Key Insights:
  • 6 Building Random Forest Model
    • 6.1 Data Sampling
    • 6.2 Preparing coordinates data
    • 6.3 Calibrating Random Forest Model
    • 6.4 Key Insights:
  • 7 Calibrating gwr predictive method
    • 7.1 Data Sampling
    • 7.2 Converting the sf data.frame to SpatialPointDataFrame
    • 7.3 Computing adaptive bandwidth
    • 7.4 Constructing the adaptive bandwidth gwr model
    • 7.5 Retrieve gwr output object
    • 7.6 Converting the test data from sf data.frame to SpatialPointDataFrame
    • 7.7 Computing predicted values of the test data
    • 7.8 Key Takeaways

Take-home Exercise 3

Author

Tai Yu Ying

Published

October 21, 2024

Modified

November 12, 2024

1 Overview

1.1 Objective

The goal of this analysis is to predict the resale prices of Housing and Development Board (HDB) flats in Singapore for the period of July to September 2024. Using data from 2023, we aim to build a predictive model that considers both structural and locational factors, capturing the unique spatial characteristics that influence HDB resale prices. This model will provide valuable insights for:

  • Potential buyers looking to make informed investment decisions

  • Real estate investors seeking accurate market forecasts

  • Policymakers aiming to understand spatial trends in housing affordability

1.2 Context

Housing in Singapore, particularly HDB flats, represents a critical component of household wealth and serves as a significant investment for most residents. Given Singapore’s compact urban environment, several factors play a role in determining HDB resale prices:

  • Locational Factors: Proximity to amenities like public transportation (MRT), shopping centers, and quality schools

  • Structural Factors: Attributes such as the flat’s size, age, and floor level

  • Macro-level Influences: Economic conditions and government policies impacting the housing market

Accurately predicting resale prices is essential not only for financial planning and investment but also for urban planning and policy-making to ensure sustainable housing affordability.

1.3 Methodology Overview

Traditional models like Ordinary Least Squares (OLS) regression have limitations when applied to spatial data, as they often ignore spatial heterogeneity and autocorrelation. These limitations include:

  • Spatial Heterogeneity: Relationships between housing prices and influencing factors vary across locations.

  • Spatial Autocorrelation: Housing prices in nearby areas tend to be similar, leading to clustering effects.

To address these challenges, we will employ Geographically Weighted Models (GWMs), specifically:

  • Geographically Weighted Regression (GWR), which captures spatial variability in linear relationships.

By comparing the performance of OLS and GWR models, this study will demonstrate the effectiveness of spatially weighted approaches for real estate price prediction in Singapore.

2 Data Upload and Initial Setup

2.1 Installing and launching the R packages

In this exercise, the following R packages will be used, they are:

  • tidyverse: A collection of R packages (including dplyr, ggplot2, tidyr, and more) for data manipulation, visualization, and cleaning. It is essential for streamlined data handling and is widely used for data wrangling and efficient manipulation of data frames.
  • sf (Simple Features): A package that provides a standard approach for handling spatial data, such as shapefiles and geographic coordinates, in R. It’s useful for transforming data into spatial formats and performing spatial operations.
  • httr: Facilitates HTTP requests, enabling access to external APIs to fetch locational or additional data about amenities or other contextual factors that may influence housing prices.
  • jsonlite: A package used for parsing JSON data, often encountered in web APIs. This package is useful for converting JSON data into R data structures, allowing for seamless integration of JSON-formatted locational or contextual data.
  • rvest: Supports web scraping, making it easy to extract data from websites. This can be useful if additional information from web sources (such as lists of nearby amenities or environmental factors) is required for analysis.
  • tmap: A powerful package for creating static and interactive thematic maps. It’s helpful for visualizing spatial patterns, clusters, and trends in housing prices or other variables across geographic areas.
  • leaflet: A mapping package focused on interactive maps. It is useful for creating dynamic spatial visualizations, which can help communicate results effectively to stakeholders.
  • ggstatsplot: An extension of ggplot2 for enhanced statistical visualizations, adding statistical information and context to graphs. It’s useful for presenting both spatial and non-spatial relationships within the dataset.
  • spdep: Used for spatial dependency analysis, spdep provides tools for calculating spatial autocorrelation (e.g., Moran’s I) and creating spatial weights, essential for analyzing spatial relationships among housing prices or other spatial data points.
  • spgwr: Implements Geographically Weighted Regression (GWR) in R. This is useful for local regression analyses that reveal spatial variations in relationships, such as the effect of locational and structural factors on housing prices.
  • olsrr: A package for ordinary least squares (OLS) regression diagnostics, which can aid in assessing model assumptions, identifying influential observations, and evaluating model performance.
  • gtsummary: Provides summary tables and statistics in a clean format, making it easy to generate quick overviews of data or model outputs. Useful for generating reports with organized statistical summaries.
  • GWmodel: A specialized package for geographically weighted models, including Geographically Weighted Random Forests (GWRF), which are advanced models that capture complex spatial patterns in data.
  • rsample: A package for creating resampling objects, which is useful for cross-validation and other validation strategies to assess model performance on different subsets of data.
  • ranger: An efficient implementation of the Random Forest algorithm in R, which can handle large datasets and be applied in predictive modeling tasks, including spatial modeling when combined with GWmodel.
  • spatialML: Supports machine learning on spatial data, providing tools that are specifically designed to handle the unique characteristics of spatial data in predictive modeling.
pacman::p_load(tidyverse, sf, httr, jsonlite, rvest, tmap, leaflet, ggstatsplot, spdep, spgwr, olsrr, gtsummary, GWmodel, rsample, ranger, SpatialML)

3 Data Import and Preparation

3.1 Primary Dataset

HDB Resale Flat Prices: The primary dataset for this analysis is the HDB Resale Flat Prices dataset available from Data.gov.sg. This dataset provides information on resale transactions for HDB flats, including price, location, flat type, and structural details. Key fields in this dataset that will be used for the analysis include:

  • Flat Type: Differentiates between three-room, four-room, and five-room flats, which are the focus of the study.

  • Transaction Price: The resale price of each HDB flat, which is the dependent variable to be predicted.

  • Floor Area and Floor Level: Measures of the flat’s size and position, which are structural factors that impact value.

  • Remaining Lease: The number of years left on the lease, crucial for pricing as HDB flats depreciate over time.

resale <- read_csv("data/HDB/rawdata/resale.csv") %>%
    filter(
      month >= "2023-01" & month <= "2023-12",
      flat_type %in% c("3 ROOM", "4 ROOM", "5 ROOM")
    )
Rows: 192234 Columns: 11
  ── Column specification ────────────────────────────────────────────────────────
  Delimiter: ","
  chr (8): month, town, flat_type, block, street_name, storey_range, flat_mode...
  dbl (3): floor_area_sqm, lease_commence_date, resale_price
  
  ℹ Use `spec()` to retrieve the full column specification for this data.
  ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

The regional classification data for Singapore was sourced from Wikipedia, a widely recognized reference for administrative and geographical information. This dataset delineates Singapore into five primary regions: North, North-East, East, Central, and West, and assigns specific towns or planning areas to each region. Incorporating this classification enables a structured spatial analysis by providing a consistent regional framework. By joining this regional data with the primary dataset based on town names, the analysis can account for spatial heterogeneity and facilitate comparisons across regions, thus enhancing the rigor and depth of socio-economic or real estate trend analysis within Singapore.

region <- read_csv("data/HDB/rawdata/region.csv") 
Rows: 55 Columns: 2
  ── Column specification ────────────────────────────────────────────────────────
  Delimiter: ","
  chr (2): town, region
  
  ℹ Use `spec()` to retrieve the full column specification for this data.
  ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Define a helper function to capitalize the first letter of each word
  capitalize_words <- function(x) {
    sapply(strsplit(x, " "), function(words) {
      paste(toupper(substring(words, 1, 1)), tolower(substring(words, 2)), sep = "", collapse = " ")
    })
  }
  
  # Apply the helper function to standardize the `town` column to `region`
  region <- region %>%
    mutate(town = capitalize_words(town))
resale_tidy <- resale %>%
    
    # Create a new `address` column by combining `block` and `street_name`
    mutate(address = paste(block, street_name)) %>%
    
    # Extract the first two characters of `remaining_lease` as years and convert to integer
    mutate(remaining_lease_yr = as.integer(
      str_sub(remaining_lease, 0, 2))) %>%
    
    # Extract characters from position 9 to 11 of `remaining_lease` as months and convert to integer , default to 0 if missing
    mutate(remaining_lease_mth = if_else(is.na(as.integer(str_sub(remaining_lease, 9, 11))), 
                                         0, 
                                         as.integer(str_sub(remaining_lease, 9, 11)))) %>%
    
    # Apply the helper function to standardize the `town` column in `resale_tidy` dataset
    mutate(town = capitalize_words(town)) %>%
  
    # Perform the join
    left_join(region, by = "town") %>%
    
    # Manually assign "Central" region to specific towns after the join
    mutate(region = ifelse(town %in% c("Central Area", "Kallang/whampoa"), "Central", region)) %>%
    
    # Calculate total remaining lease in months by converting years to months and adding the extracted months
    mutate(
      remaining_lease_total_mths = (remaining_lease_yr * 12) + remaining_lease_mth
    ) %>%
    
    # Remove the intermediate columns `remaining_lease_yr` and `remaining_lease_mth`
    select(-remaining_lease_yr, -remaining_lease_mth) %>%
    
    # Split `storey_range` into minimum and maximum storey columns
    # Extract the first two characters of `storey_range` as the minimum storey level
    mutate(storey_min = as.integer(str_sub(storey_range, 1, 2))) %>%
    
    # Extract the last two characters of `storey_range` as the maximum storey level
    mutate(storey_max = as.integer(str_sub(storey_range, 7, 8))) %>%
    
    # Calculate the average storey level by averaging `storey_min` and `storey_max`
    mutate(
      storey_avg = (storey_min + storey_max) / 2
    )
write_rds(resale_tidy, "data/HDB/rds/resale_tidy.rds")
resale_tidy <- read_rds("data/HDB/rds/resale_tidy.rds")
add_list <- sort(unique(resale_tidy$address))
get_coords <- function(add_list){
    
    # Create a data frame to store all retrieved coordinates
    postal_coords <- data.frame()
      
    for (i in add_list){
      #print(i)
  
      r <- GET('https://www.onemap.gov.sg/api/common/elastic/search?',
             query=list(searchVal=i,
                       returnGeom='Y',
                       getAddrDetails='Y'))
      data <- fromJSON(rawToChar(r$content))
      found <- data$found
      res <- data$results
      
      # Create a new data frame for each address
      new_row <- data.frame()
      
      # If single result, append 
      if (found == 1){
        postal <- res$POSTAL 
        lat <- res$LATITUDE
        lng <- res$LONGITUDE
        new_row <- data.frame(address= i, 
                              postal = postal, 
                              latitude = lat, 
                              longitude = lng)
      }
      
      # If multiple results, drop NIL and append top 1
      else if (found > 1){
        # Remove those with NIL as postal
        res_sub <- res[res$POSTAL != "NIL", ]
        
        # Set as NA first if no Postal
        if (nrow(res_sub) == 0) {
            new_row <- data.frame(address= i, 
                                  postal = NA, 
                                  latitude = NA, 
                                  longitude = NA)
        }
        
        else{
          top1 <- head(res_sub, n = 1)
          postal <- top1$POSTAL 
          lat <- top1$LATITUDE
          lng <- top1$LONGITUDE
          new_row <- data.frame(address= i, 
                                postal = postal, 
                                latitude = lat, 
                                longitude = lng)
        }
      }
  
      else {
        new_row <- data.frame(address= i, 
                              postal = NA, 
                              latitude = NA, 
                              longitude = NA)
      }
      
      # Add the row
      postal_coords <- rbind(postal_coords, new_row)
    }
    return(postal_coords)
  }
coords <- get_coords(add_list)
write_rds(coords, "data/HDB/rds/coords.rds")
coords <- read_rds("data/HDB/rds/coords.rds")
# Ensure that both address columns are in uppercase and have consistent formatting
  resale_tidy <- resale_tidy %>%
    mutate(address = toupper(address))
  
  coords <- coords %>%
    mutate(address = toupper(address))
  
  # Perform the join by address column
  resale_combined <- resale_tidy %>%
    left_join(coords, by = "address")
# Select and arrange columns in the specified order
  resale_final <- resale_combined %>%
    select(resale_price, town, region, flat_type, flat_model, floor_area_sqm, storey_avg, remaining_lease_total_mths, latitude, longitude)
  
  # Display the final data
  head(resale_final)
# A tibble: 6 × 10
    resale_price town       region  flat_type flat_model floor_area_sqm storey_avg
           <dbl> <chr>      <chr>   <chr>     <chr>               <dbl>      <dbl>
  1       380000 Ang Mo Kio North-… 3 ROOM    New Gener…             67          5
  2       635000 Ang Mo Kio North-… 3 ROOM    Model A                70         26
  3       380000 Ang Mo Kio North-… 3 ROOM    New Gener…             67          8
  4       365000 Ang Mo Kio North-… 3 ROOM    New Gener…             73          5
  5       418000 Ang Mo Kio North-… 3 ROOM    New Gener…             73          8
  6       380000 Ang Mo Kio North-… 3 ROOM    New Gener…             67          5
  # ℹ 3 more variables: remaining_lease_total_mths <dbl>, latitude <chr>,
  #   longitude <chr>
# Convert resale_final to an sf object if it has latitude and longitude columns
  resale_final <- resale_final %>%
    st_as_sf(coords = c("longitude", "latitude"), crs = 4326)  # Set the original CRS (e.g., WGS84)
  
  # Transform to the desired CRS (e.g., Singapore's SVY21 CRS)
  resale_final <- st_transform(resale_final, crs = 3414)

3.2 Secondary Data — Locational Data

To enhance the predictive model, we incorporate secondary data sources to capture locational factors that influence HDB resale prices. These factors typically require geographic data about the proximity to amenities, which can be collected from multiple sources:

  • Public Transportation (MRT Stations):

    • Data Source: LTA MRT Station Exit (GEOJSON) dataset from Data.gov.sg, provided by the Land Transport Authority (LTA).
    • Purpose: This dataset provides the geographical coordinates of each MRT station exit. Using this data will allow us to calculate the precise distance from HDB flats to the nearest MRT exit, giving a more accurate measure of accessibility than a central station location would.
    • Usage in Analysis: By incorporating MRT station exits instead of just station locations, we can improve the precision of our proximity calculations. Proximity to MRT stations is a critical factor influencing HDB resale prices, as flats closer to MRT exits are generally more attractive to buyers due to the ease of access to public transport.
    • By using MRT station exit data, we can calculate the shortest walking distance from each HDB flat to the nearest MRT exit, enhancing the locational data quality and potentially improving the predictive power of the model for HDB resale prices.
mrt <- st_read("data/Locational/rawdata/mrt.geojson")
  mrt <- st_transform(mrt, crs = 3414)
# Calculate nearest distance to MRT station
  nearest_mrt <- st_nearest_feature(resale_final, mrt)
  resale_final <- resale_final %>%
    mutate(proximity_to_mrt = st_distance(resale_final, mrt[nearest_mrt, ], by_element = TRUE))%>%
    mutate(proximity_to_mrt = as.numeric(proximity_to_mrt) / 1000)  # Convert meters to km
  • Good Primary Schools:

    • Data Source: Ministry of Education’s data on schools (from the CSV file provided) and additional information from Math Nuggets - Primary School Rankings 2024.

      Primary School Rankings 2024, extracted from: https://mathnuggets.sg/best-primary-schools-in-singapore/
      Ranking School
      1 Methodist Girls’ School (Primary)
      2 Tao Nan School
      3 Ai Tong School
      4 Holy Innocents’ Primary School
      5 CHIJ St. Nicholas Girls’ School (Primary)
      6 Admiralty Primary School
      7 St. Joseph’s Institution Junior
      8 Catholic High School (Primary)
      9 Anglo-Chinese School (Junior)
      10 Chongfu School
      11 Kong Hwa School
      12 St. Hilda’s Primary School
      13 Anglo-Chinese School (Primary)
      14 Nan Chiau Primary School
      15 Nan Hua Primary School
      16 Nanyang Primary School
      17 Pei Hwa Presbyterian Primary School
      18 Kuo Chuan Presbyterian Primary School
      19 Rulang Primary School
      20 Singapore Chinese Girls’ Primary School
    • Purpose: Proximity to top 20 primary schools often increases property values due to the demand for accessible quality education. This analysis will focus on the top primary schools as per the 2024 rankings, which will be scraped from the website. Only the schools ranked as the best in Singapore will be included in the dataset, filtered to match the school_name field in full capital letters for consistency.

# Load the CSV file with school information
  school_data <- read.csv("data/Locational/rawdata/Generalinformationofschools.csv")
  
  # Define a list of top 20 primary schools in uppercase
  top_schools <- c(
    "METHODIST GIRLS' SCHOOL (PRIMARY)", 
    "TAO NAN SCHOOL",
    "AI TONG SCHOOL",
    "HOLY INNOCENTS' PRIMARY SCHOOL",
    "CHIJ ST. NICHOLAS GIRLS' SCHOOL",
    "ADMIRALTY PRIMARY SCHOOL",
    "ST. JOSEPH'S INSTITUTION JUNIOR",
    "CATHOLIC HIGH SCHOOL",
    "ANGLO-CHINESE SCHOOL (JUNIOR)",
    "CHONGFU SCHOOL",
    "KONG HWA SCHOOL",
    "ST. HILDA'S PRIMARY SCHOOL",
    "ANGLO-CHINESE SCHOOL (PRIMARY)",
    "NAN CHIAU PRIMARY SCHOOL",
    "NAN HUA PRIMARY SCHOOL",
    "NANYANG PRIMARY SCHOOL",
    "PEI HWA PRESBYTERIAN PRIMARY SCHOOL",
    "KUO CHUAN PRESBYTERIAN PRIMARY SCHOOL",
    "RULANG PRIMARY SCHOOL",
    "SINGAPORE CHINESE GIRLS' PRIMARY SCHOOL"
  )
  
  # Filter the CSV data to keep only rows with top primary schools
  filtered_school_data <- school_data %>%
    filter(school_name %in% top_schools) %>%
    select(school_name, postal_code)
  
  # Prepare list of postal codes
  postal_codes <- filtered_school_data$postal_code
  
  # Use the get_coords function to retrieve coordinates
  school_coords <- get_coords(postal_codes)
  
  # Convert postal_code to character in filtered_school_data
  filtered_school_data <- filtered_school_data %>%
    mutate(postal_code = as.character(postal_code))
  
  # Merge the coordinates back with the filtered school data
  final_school_data <- filtered_school_data %>%
    left_join(school_coords, by = c("postal_code" = "postal"))
  
  # Display final dataset
  head(final_school_data)
  
  # Convert final_school_data to an sf object if it has latitude and longitude columns
  final_school_data <- final_school_data %>%
    st_as_sf(coords = c("longitude", "latitude"), crs = 4326)  # Set the original CRS (e.g., WGS84)
  
  # Transform to the desired CRS (e.g., Singapore's SVY21 CRS)
  final_school_data <- st_transform(final_school_data, crs = 3414)
# Calculate nearest distance to good school
  nearest_goodprisch <- st_nearest_feature(resale_final, final_school_data)
  resale_final <- resale_final %>%
    mutate(proximity_to_goodprisch = st_distance(resale_final, final_school_data[nearest_goodprisch, ], by_element = TRUE))%>%
    mutate(proximity_to_goodprisch = as.numeric(proximity_to_goodprisch) / 1000)  # Convert meters to km
#Create a buffer of 1km around each resale flat
  buffer_1km <- st_buffer(resale_final, 
                          dist = 1000)
  
  #Plot the newly created buffers and each good primary school
  tmap_mode("view")
  tm_shape(buffer_1km) +
    tm_polygons() +
  tm_shape(final_school_data) +
    tm_dots()
  
  #Count the number of good primary schools
  resale_final$within_1km_prisch <- lengths(
    st_intersects(buffer_1km, final_school_data))
  • Healthcare and Eldercare Facilities:

    • Data Source: dataset from Data.gov.sg, provided by the Ministry of Health (MOH).

    • Purpose: Accessibility to healthcare facilities can be a factor, especially for buyers looking for long-term residence or properties suitable for elderly family members.

eldercare <- st_read(dsn = "data/Locational/rawdata",
                       layer = "ELDERCARE") %>%
    st_transform(crs = 3414)
# Calculate nearest distance to eldercare
  nearest_eldercare <- st_nearest_feature(resale_final, eldercare)
  resale_final <- resale_final %>%
    mutate(proximity_to_eldercare = st_distance(resale_final, eldercare[nearest_eldercare, ], by_element = TRUE))%>%
    mutate(proximity_to_eldercare = as.numeric(proximity_to_eldercare) / 1000)  # Convert meters to km
CHAS <- st_read("data/Locational/rawdata/CHAS Clinics.kml") %>%
    st_transform(crs = 3414)
# Calculate nearest distance to CHAS clinic
  nearest_CHAS <- st_nearest_feature(resale_final, CHAS)
  resale_final <- resale_final %>%
    mutate(proximity_to_CHAS = st_distance(resale_final, CHAS[nearest_CHAS, ], by_element = TRUE))%>%
    mutate(proximity_to_CHAS = as.numeric(proximity_to_CHAS) / 1000)  # Convert meters to km
  • Supermarkets:

    • Data Source: dataset from Data.gov.sg, provided by the Singapore Food Agency (SFA).
    • Purpose: Supermarkets play a crucial role in daily life by providing access to essential groceries and household items. Proximity to a supermarket is highly desirable for residents, especially for families. As such, flats located near supermarkets are generally more attractive, as they offer added convenience for everyday needs, positively impacting resale prices.
spmrkt <- st_read("data/Locational/rawdata/Supermarket.geojson")
  summary(spmrkt)
  spmrkt <- st_transform(spmrkt, crs = 3414)
# Calculate nearest distance to supermarket
  nearest_spmrkt <- st_nearest_feature(resale_final, spmrkt)
  resale_final <- resale_final %>%
    mutate(proximity_to_spmrkt = st_distance(resale_final, spmrkt[nearest_spmrkt, ], by_element = TRUE))%>%
    mutate(proximity_to_spmrkt = as.numeric(proximity_to_spmrkt) / 1000)  # Convert meters to km
  • Food Amenities:

    • Data Source: dataset from Data.gov.sg, provided by the National Environment Agency (NEA).
    • Purpose: Access to diverse food options is highly valued in Singapore. Flats located near popular hawker centers are likely to have higher resale values, as they offer residents convenient access to a variety of affordable dining options, enhancing the lifestyle appeal of the location.
hawker <- st_read("data/Locational/rawdata/Hawker.geojson")
  summary(hawker)
  hawker <- st_transform(hawker, crs = 3414)
# Calculate nearest distance to hawker
  nearest_hawker <- st_nearest_feature(resale_final, hawker)
  resale_final <- resale_final %>%
    mutate(proximity_to_hawker = st_distance(resale_final, hawker[nearest_hawker, ], by_element = TRUE))%>%
    mutate(proximity_to_hawker = as.numeric(proximity_to_hawker) / 1000)  # Convert meters to km
  • Parks and Nature Reserves:

    • Data Source: dataset from Data.gov.sg, provided by the National Parks Board (NPARKS).

    • Purpose: Proximity to parks and nature reserves can increase property appeal for families and nature enthusiasts.

parks <- st_read("data/Locational/rawdata/Parks.kml") %>%
    st_transform(crs = 3414)
# Calculate nearest distance to parks and nature reserves
  nearest_parks <- st_nearest_feature(resale_final, parks)
  resale_final <- resale_final %>%
    mutate(proximity_to_parks = st_distance(resale_final, parks[nearest_parks, ], by_element = TRUE))%>%
    mutate(proximity_to_parks = as.numeric(proximity_to_parks) / 1000)  # Convert meters to km
  • Childcare Facilities:

    • Data Source: Childcare center locations from Data.gov.sg or private datasets if available.

    • Purpose: The presence of childcare facilities nearby is valuable for young families, potentially influencing their interest in the property.

childcare <- st_read("data/Locational/rawdata/Childcare.geojson") %>%
    st_transform(crs = 3414)
# Calculate nearest distance to childcare
  nearest_childcare <- st_nearest_feature(resale_final, childcare)
  resale_final <- resale_final %>%
    mutate(proximity_to_childcare = st_distance(resale_final, childcare[nearest_childcare, ], by_element = TRUE))%>%
    mutate(proximity_to_childcare = as.numeric(proximity_to_childcare) / 1000)  # Convert meters to km
#Create a buffer of 350m around each resale flat
  buffer_350m <- st_buffer(resale_final, 
                          dist = 350)
  
  #Plot the newly created buffers and each childcare
  tmap_mode("view")
  tm_shape(buffer_350m) +
    tm_polygons() +
  tm_shape(childcare) +
    tm_dots()
  
  #Count the number of childcare
  resale_final$within_350m_childcare <- lengths(
    st_intersects(buffer_350m, childcare))
  • Public Transportation Accessibility:

    • Bus Stops: Data on bus stop locations within 350m or 1km from the property, sourced from LTA’s DataMall

    • Purpose: Ease of access to public transportation increases a property’s attractiveness and can be a significant factor in resale value.

busstop <- st_read(dsn = "data/Locational/rawdata",
                       layer = "BusStop") %>%
    st_transform(crs = 3414)
# Calculate nearest distance to bus stop
  nearest_busstop <- st_nearest_feature(resale_final, busstop)
  resale_final <- resale_final %>%
    mutate(proximity_to_busstop = st_distance(resale_final, busstop[nearest_busstop, ], by_element = TRUE))%>%
    mutate(proximity_to_busstop = as.numeric(proximity_to_busstop) / 1000)  # Convert meters to km
#Plot the newly created buffers and each bus stops
  tmap_mode("view")
  tm_shape(buffer_350m) +
    tm_polygons() +
  tm_shape(busstop) +
    tm_dots()
  
  #Count the number of bus stops
  resale_final$within_350m_busstop <- lengths(
    st_intersects(buffer_350m, busstop))
  • Shopping Amenities:

    • Data Source: Location data on shopping malls can be sourced from OpenStreetMap or business directories in Singapore.
    • Purpose: Shopping malls provide access to a wide range of retail options, entertainment, and services, which enhance the appeal of nearby properties. Flats close to malls are often in higher demand due to the convenience of having various amenities within reach, potentially increasing resale values.
# Define the URL of the Wikipedia page
  url <- "https://en.wikipedia.org/wiki/List_of_shopping_malls_in_Singapore"
  
  # Read the HTML content of the page
  page <- read_html(url)
  
  # Extract mall names from list items across all sections
  mall <- page %>%
    html_nodes("ul > li") %>% # Selects all list items that are direct children of unordered lists
    html_text()
  
  # Filter mall names to exclude non-mall related content, if needed
  mall <- mall[mall != ""]
  
  # Remove the "[1]" notation from Knightsbridge or similar entries
  mall <- gsub("\\[1\\]", "", mall)
  
  # Extract only lines 50 to 223
  mall <- mall[50:223]
  
  # Make specific replacements in the extracted range
  mall[42] <- gsub("GRiD\\(pomo\\)", "GR.ID", mall[42])  # Change "GRiD(pomo)" to "GR.ID"
  mall[85] <- gsub("Paya Lebar Quarter \\(PLQ\\)", "Paya Lebar Quarter", mall[85])  # Change "Paya Lebar Quarter (PLQ)" to "Paya Lebar Quarter"
  
  # Print extracted and modified mall names for the specified range
  print(mall)
# Function to get coordinates from OneMap API
  get_coordinates <- function(mall_name) {
    base_url <- "https://www.onemap.gov.sg/api/common/elastic/search?"
    response <- GET(base_url, query = list(searchVal = mall_name, returnGeom = "Y", getAddrDetails = "N"))
    data <- content(response, "parsed")
    if (length(data$results) > 0) {
      result <- data$results[[1]]
      return(c(result$X, result$Y))
    } else {
      return(c(NA, NA))
    }
  }
  
  # Initialize vectors to store coordinates
  x_coords <- numeric(length(mall))
  y_coords <- numeric(length(mall))
  
  # Loop through each mall name to get coordinates
  for (i in seq_along(mall)) {
    coords <- get_coordinates(mall[i])
    x_coords[i] <- coords[1]
    y_coords[i] <- coords[2]
    Sys.sleep(1)  # Pause to respect API rate limits
  }
  
  # Combine mall names and their coordinates into a data frame
  mall_coord <- data.frame(
    Mall_Name = mall,
    longitude = x_coords,
    latitude = y_coords,
    stringsAsFactors = FALSE
  )
  
  # Print the data frame to check the results
  print(mall_coord)
# Convert to an sf object with EPSG:3414 (SVY21 coordinate system for Singapore)
  mall_coord <- mall_coord %>%
    st_as_sf(coords = c("longitude", "latitude"), crs = 3414)  # Set the original CRS (e.g., WGS84)
  
  # Print the updated mall_coord to view the geometry points in (longitude, latitude)
  print(mall_coord)
write_rds(mall_coord, "data/HDB/rds/mall_coord.rds")
mall_coord <- read_rds("data/HDB/rds/mall_coord.rds")
# Plot using tmap
  tmap_mode("view")
  tm_shape(mall_coord) +
    tm_dots(col = "blue", size = 0.1, alpha = 0.8) +
    tm_basemap("OpenStreetMap")
# Calculate nearest distance to shopping mall
  nearest_mall <- st_nearest_feature(resale_final, mall_coord)
  resale_final <- resale_final %>%
    mutate(proximity_to_mall = st_distance(resale_final, mall_coord[nearest_mall, ], by_element = TRUE))%>%
    mutate(proximity_to_mall = as.numeric(proximity_to_mall) / 1000)  # Convert meters to km
write_rds(resale_final, "data/HDB/rds/resale_final.rds")
resale_final <- read_rds("data/HDB/rds/resale_final.rds")

Each of these secondary data sources will be processed to derive proximity variables (e.g., distance to nearest MRT station, number of schools within 1km) for inclusion in the predictive model. This helps capture the spatial elements that influence property prices beyond the structural features of the flats themselves.

3.3 Spatial Autocorrelation Check

The purpose of checking for spatial autocorrelation is to determine whether HDB resale prices exhibit spatial dependency. If there is a significant spatial autocorrelation, it means that the prices of nearby HDB flats tend to be similar. This spatial dependency justifies the use of geographically weighted models, which are designed to handle location-based variations.

Moran’s I is a statistical measure that assesses the degree of spatial autocorrelation in a dataset. It ranges from -1 to +1:

  • Positive Values: Indicate positive spatial autocorrelation, where similar values (e.g., high or low resale prices) are clustered together.

  • Negative Values: Indicate negative spatial autocorrelation, where dissimilar values are adjacent.

  • Value Near Zero: Suggests spatial randomness, where there is no clear pattern in the spatial distribution.

Ensure that the coords data is in spatial format (e.g., sf object in R) with coordinates for each resale flat. We’ll need latitude and longitude coordinates for each record to analyze spatial relationships.

A spatial weight matrix defines the spatial relationships between points. We can create a weight matrix based on distance or neighborhood contiguity.

For HDB flats, a distance-based approach (e.g., nearest neighbors) is often appropriate.

When calculating k-nearest neighbors, identical points cause issues because the algorithm cannot uniquely identify the nearest neighbors if multiple points are located at the same coordinates. We can aggregate the data by averaging or summing relevant attributes to keep one unique location per point.

resale_aggregated <- resale_final %>%
    group_by(flat_type, floor_area_sqm, storey_avg, remaining_lease_total_mths, geometry) %>%
    summarize(
      resale_price = mean(resale_price, na.rm = TRUE)
    ) %>%
    ungroup()
write_rds(resale_aggregated, "data/HDB/rds/resale_aggregated.rds")
resale_aggregated <- read_rds("data/HDB/rds/resale_aggregated.rds")

Use Moran’s I to test for spatial autocorrelation of resale prices across the dataset.

A significant Moran’s I value (with a p-value < 0.05) indicates spatial autocorrelation, suggesting that geographically weighted models (e.g., GWR) are appropriate for this analysis.

Interpret Moran’s I Results:

  • Positive Moran’s I Value (with significant p-value): Indicates clustering of similar prices (e.g., high or low resale prices in specific areas), justifying the use of geographically weighted models to account for spatial variability.

  • Non-significant Moran’s I Value (or close to zero): Implies no spatial autocorrelation, suggesting that traditional models like Ordinary Least Squares (OLS) might be sufficient as there is no strong spatial dependency in the data.

# Convert geometry to spatial coordinates for neighborhood matrix
  coords <- st_coordinates(resale_aggregated)
  
  # Create a k-nearest neighbor structure
  knn <- knearneigh(coords, k = 5)  # Set k to the desired number of neighbors
  nb <- knn2nb(knn)
  weight_matrix <- nb2listw(nb, style = "W")
  
  # Calculate Moran's I
  moran_test <- moran.test(resale_aggregated$resale_price, weight_matrix)
write_rds(moran_test, "data/HDB/rds/moran_test.rds")
moran_test <- read_rds("data/HDB/rds/moran_test.rds")
# View results
  print(moran_test)

      Moran I test under randomisation
  
  data:  resale_aggregated$resale_price  
  weights: weight_matrix    
  
  Moran I statistic standard deviate = 172, p-value < 2.2e-16
  alternative hypothesis: greater
  sample estimates:
  Moran I statistic       Expectation          Variance 
       6.617663e-01     -4.329942e-05      1.480471e-05 

To interpret the results of Moran’s I test for spatial autocorrelation, let’s break down the key components of the output:

3.3.1 Moran’s I Statistic:

  • Value: approximately 0.662

  • Interpretation: Moran’s I ranges from -1 to +1. Values close to +1 indicate strong positive spatial autocorrelation, meaning similar values are clustered together in space. Values close to -1 indicate strong negative spatial autocorrelation, meaning dissimilar values are located near each other. A value of 0 suggests a random spatial pattern (no spatial autocorrelation).

  • In your case, a Moran’s I of approximately 0.662 indicates moderate to strong positive spatial autocorrelation. This suggests that higher resale prices tend to cluster with other higher resale prices, and lower prices tend to cluster with other lower prices in the spatial layout of your dataset.

3.3.2 Expectation:

  • Value: -4.329942e-05 (close to zero).

  • Interpretation: This is the expected value of Moran’s I under the null hypothesis of spatial randomness (no spatial autocorrelation). If the observed Moran’s I significantly deviates from this expectation, it suggests spatial autocorrelation.

3.3.3 Variance:

  • Value: 1.480471e-05

  • Interpretation: This is the variance of Moran’s I under the null hypothesis, used to calculate the significance of the observed statistic.

3.3.4 Standard Deviation and P-Value:

  • Standard Deviation: 171.99

  • P-Value: <2.2e-16

  • Interpretation: The very low p-value indicates that the observed Moran’s I is statistically significant, rejecting the null hypothesis of spatial randomness. This means there is strong evidence of spatial clustering in resale prices, with a pattern that is unlikely to have occurred by random chance.

3.3.5 Summary of Interpretation

  • Spatial Pattern: There is strong evidence of spatial autocorrelation in resale prices, indicating that properties with similar prices are geographically clustered.

  • Significance: The significant p-value (very close to zero) confirms that the observed spatial pattern is statistically meaningful.

  • Practical Implication: Spatial factors such as proximity to amenities, neighborhood quality, or other spatially dependent factors may be influencing resale prices. You may consider using geographically weighted regression (GWR) or other spatial models to further explore these spatial relationships.

3.3.6 Geographical Clusters of High and Low Prices:

  • Darker blue colors represent higher resale prices, while lighter yellow colors represent lower prices.

  • There appear to be clusters of higher-priced flats in certain areas, likely around central or well-connected neighborhoods, as well as newer developments. This clustering aligns with the positive spatial autocorrelation indicated by Moran’s I test, where high-price units tend to be near other high-price units, and similarly for lower-priced units.

4 Exploratory Data Analysis (EDA)

EDA is essential to understand the dataset, identify patterns, and prepare it for modeling. The steps below guide you through performing a thorough EDA, with each component addressing different aspects of the data.

4.1 Descriptive Statistics

The first step in EDA is to generate descriptive statistics, which provide a numerical summary of key variables. By calculating metrics like the mean, median, standard deviation, minimum, and maximum for resale prices, as well as structural features like the area, floor level, remaining lease, and age of the unit, we can get a sense of the central tendencies and variability in the data.

This overview is helpful for spotting potential outliers or variations that may influence resale prices. For example, high variability in resale prices might indicate a wide range of property values across different regions and flat types. Descriptive statistics also help establish a baseline understanding of the dataset, highlighting the general characteristics of HDB flats within the specified time range.

The codes chunks below uses glimpse() to display the data structure of will do the job.

glimpse(resale_final)
Rows: 23,555
  Columns: 22
  $ resale_price               <dbl> 380000, 635000, 380000, 365000, 418000, 380…
  $ town                       <chr> "Ang Mo Kio", "Ang Mo Kio", "Ang Mo Kio", "…
  $ region                     <chr> "North-East", "North-East", "North-East", "…
  $ flat_type                  <chr> "3 ROOM", "3 ROOM", "3 ROOM", "3 ROOM", "3 …
  $ flat_model                 <chr> "New Generation", "Model A", "New Generatio…
  $ floor_area_sqm             <dbl> 67, 70, 67, 73, 73, 67, 89, 68, 75, 74, 75,…
  $ storey_avg                 <dbl> 5, 26, 8, 5, 8, 5, 8, 5, 5, 2, 2, 5, 8, 8, …
  $ remaining_lease_total_mths <dbl> 649, 1065, 649, 640, 640, 643, 673, 685, 67…
  $ geometry                   <POINT [m]> POINT (28537.68 38825.23), POINT (292…
  $ proximity_to_mrt           <dbl> 0.4046012, 0.7314224, 0.4046012, 0.5263600,…
  $ proximity_to_goodprisch    <dbl> 0.8000960, 1.0934641, 0.8000960, 1.1818905,…
  $ within_1km_prisch          <int> 2, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
  $ proximity_to_eldercare     <dbl> 0.40435866, 0.25495445, 0.40435866, 0.05809…
  $ proximity_to_CHAS          <dbl> 0.12666079, 0.28840296, 0.12666079, 0.06303…
  $ proximity_to_spmrkt        <dbl> 0.15775712, 0.31448012, 0.15775712, 0.06303…
  $ proximity_to_hawker        <dbl> 0.1378719, 0.3828329, 0.1378719, 0.1477741,…
  $ proximity_to_parks         <dbl> 0.19858307, 0.06656148, 0.19858307, 0.17137…
  $ proximity_to_childcare     <dbl> 1.857691e-01, 1.151054e-01, 1.857691e-01, 1…
  $ within_350m_childcare      <int> 2, 4, 2, 5, 5, 1, 2, 3, 3, 4, 5, 1, 3, 3, 3…
  $ proximity_to_busstop       <dbl> 0.16643716, 0.02802881, 0.16643716, 0.18021…
  $ within_350m_busstop        <int> 6, 4, 6, 4, 4, 5, 4, 10, 4, 5, 5, 4, 6, 10,…
  $ proximity_to_mall          <dbl> 1.0027651, 0.6647901, 1.0027651, 0.4886993,…

Next, summary() of base R is used to display the summary statistics of resale_final tibble data frame.

summary(resale_final)
  resale_price         town              region           flat_type        
   Min.   : 150000   Length:23555       Length:23555       Length:23555      
   1st Qu.: 450000   Class :character   Class :character   Class :character  
   Median : 545000   Mode  :character   Mode  :character   Mode  :character  
   Mean   : 562716                                                           
   3rd Qu.: 640000                                                           
   Max.   :1500000                                                           
    flat_model        floor_area_sqm     storey_avg    remaining_lease_total_mths
   Length:23555       Min.   : 52.00   Min.   : 2.00   Min.   : 505.0            
   Class :character   1st Qu.: 75.00   1st Qu.: 5.00   1st Qu.: 730.0            
   Mode  :character   Median : 93.00   Median : 8.00   Median : 887.0            
                      Mean   : 93.37   Mean   : 8.96   Mean   : 889.6            
                      3rd Qu.:110.00   3rd Qu.:11.00   3rd Qu.:1092.0            
                      Max.   :176.00   Max.   :50.00   Max.   :1154.0            
            geometry     proximity_to_mrt  proximity_to_goodprisch
   POINT        :23555   Min.   :0.01463   Min.   :0.04958        
   epsg:3414    :    0   1st Qu.:0.30168   1st Qu.:1.17410        
   +proj=tmer...:    0   Median :0.51494   Median :1.90283        
                         Mean   :0.58134   Mean   :2.07394        
                         3rd Qu.:0.77348   3rd Qu.:2.63467        
                         Max.   :3.49822   Max.   :7.16529        
   within_1km_prisch proximity_to_eldercare proximity_to_CHAS proximity_to_spmrkt
   Min.   :0.000     Min.   :0.0000         Min.   :0.0000    Min.   :0.0000     
   1st Qu.:0.000     1st Qu.:0.3218         1st Qu.:0.1078    1st Qu.:0.1723     
   Median :0.000     Median :0.6156         Median :0.1690    Median :0.2662     
   Mean   :0.198     Mean   :0.7870         Mean   :0.1844    Mean   :0.2880     
   3rd Qu.:0.000     3rd Qu.:1.0739         3rd Qu.:0.2420    3rd Qu.:0.3787     
   Max.   :2.000     Max.   :4.7675         Max.   :2.7122    Max.   :3.3254     
   proximity_to_hawker proximity_to_parks proximity_to_childcare
   Min.   :0.006981    Min.   :0.006039   Min.   :0.00000       
   1st Qu.:0.351676    1st Qu.:0.303246   1st Qu.:0.07462       
   Median :0.628092    Median :0.493864   Median :0.11856       
   Mean   :0.747090    Mean   :0.569264   Mean   :0.12951       
   3rd Qu.:1.007198    3rd Qu.:0.737417   3rd Qu.:0.17331       
   Max.   :2.867630    Max.   :2.066652   Max.   :2.91807       
   within_350m_childcare proximity_to_busstop within_350m_busstop
   Min.   : 0.000        Min.   :0.01543      Min.   : 0.00      
   1st Qu.: 2.000        1st Qu.:0.07423      1st Qu.: 6.00      
   Median : 3.000        Median :0.10751      Median : 8.00      
   Mean   : 3.695        Mean   :0.11455      Mean   : 7.89      
   3rd Qu.: 5.000        3rd Qu.:0.14592      3rd Qu.:10.00      
   Max.   :20.000        Max.   :0.39147      Max.   :19.00      
   proximity_to_mall
   Min.   :0.0000   
   1st Qu.:0.3810   
   Median :0.5887   
   Mean   :0.6486   
   3rd Qu.:0.8614   
   Max.   :3.1782   

4.2 Visualize Relationships

To gain insights into the relationships between resale prices and various structural and locational features, we can create scatter plots and violin plots.

Scatter plots allow us to examine how resale prices vary with continuous variables like area, age, and remaining lease. For instance, plotting resale price against area may reveal a positive trend, suggesting that larger flats generally command higher prices. Similarly, a scatter plot of resale price versus age could show whether older flats have lower resale values, providing insight into the impact of depreciation. By mapping prices geographically, we can identify areas with higher or lower resale values, giving us an understanding of spatial clustering in the data. These visualizations make it easier to observe trends and relationships that may not be apparent from summary statistics alone.

# Scatter plot: Resale price vs Area
  plot_area <- ggplot(resale_final, aes(x = floor_area_sqm, y = resale_price)) +
    geom_point(alpha = 0.5) +
    labs(title = "Resale Price vs. Area", x = "Area (sqm)", y = "Resale Price")
  ggsave("plot_resale_price_vs_area.png", plot = plot_area, width = 7, height = 5, dpi = 300)
  
  # Scatter plot: Resale price vs Storey
  plot_storey <- ggplot(resale_final, aes(x = storey_avg, y = resale_price)) +
    geom_point(alpha = 0.5) +
    labs(title = "Resale Price vs. Storey", x = "Storey", y = "Resale Price")
  ggsave("plot_resale_price_vs_storey.png", plot = plot_storey, width = 7, height = 5, dpi = 300)
  
  # Scatter plot: Resale price vs Age of Flat
  plot_age <- ggplot(resale_final, aes(x = remaining_lease_total_mths, y = resale_price)) +
    geom_point(alpha = 0.5) +
    labs(title = "Resale Price vs. Age of Flat", x = "Age (months)", y = "Resale Price")
  ggsave("plot_resale_price_vs_age.png", plot = plot_age, width = 7, height = 5, dpi = 300)
  
  # Scatter plot: Resale price vs Proximity to Nearest MRT exits
  plot_mrt <- ggplot(resale_final, aes(x = proximity_to_mrt, y = resale_price)) +
    geom_point(alpha = 0.5) +
    labs(title = "Resale Price vs. Proximity to Nearest MRT exits", x = "Distance (km)", y = "Resale Price")
  ggsave("plot_resale_price_vs_mrt.png", plot = plot_mrt, width = 7, height = 5, dpi = 300)
  
  # Scatter plot: Resale price vs Proximity to Nearest Good Primary School
  plot_goodprisch <- ggplot(resale_final, aes(x = proximity_to_goodprisch, y = resale_price)) +
    geom_point(alpha = 0.5) +
    labs(title = "Resale Price vs. Proximity to Nearest Good Primary School", x = "Distance (km)", y = "Resale Price")
  ggsave("plot_resale_price_vs_goodprisch.png", plot = plot_goodprisch, width = 7, height = 5, dpi = 300)
  
  # Scatter plot: Resale price vs No. of Good Primary School within 1km
  plot_prisch <- ggplot(resale_final, aes(x = within_1km_prisch, y = resale_price)) +
    geom_point(alpha = 0.5) +
    labs(title = "Resale Price vs. No. of Good Primary School within 1km", x = "No. of Good Primary Schools", y = "Resale Price")
  ggsave("plot_resale_price_vs_within_1km_prisch.png", plot = plot_prisch, width = 7, height = 5, dpi = 300)
  
  # Scatter plot: Resale price vs Proximity to Nearest Eldercare
  plot_eldercare <- ggplot(resale_final, aes(x = proximity_to_eldercare, y = resale_price)) +
    geom_point(alpha = 0.5) +
    labs(title = "Resale Price vs. Proximity to Nearest Eldercare", x = "Distance (km)", y = "Resale Price")
  ggsave("plot_resale_price_vs_eldercare.png", plot = plot_eldercare, width = 7, height = 5, dpi = 300)
  
  # Scatter plot: Resale price vs Proximity to Nearest CHAS Clinic
  plot_chas <- ggplot(resale_final, aes(x = proximity_to_CHAS, y = resale_price)) +
    geom_point(alpha = 0.5) +
    labs(title = "Resale Price vs. Proximity to Nearest CHAS Clinic", x = "Distance (km)", y = "Resale Price")
  ggsave("plot_resale_price_vs_chas.png", plot = plot_chas, width = 7, height = 5, dpi = 300)
  
  # Scatter plot: Resale price vs Proximity to Nearest Supermarket
  plot_spmrkt <- ggplot(resale_final, aes(x = proximity_to_spmrkt, y = resale_price)) +
    geom_point(alpha = 0.5) +
    labs(title = "Resale Price vs. Proximity to Nearest Supermarket", x = "Distance (km)", y = "Resale Price")
  ggsave("plot_resale_price_vs_spmrkt.png", plot = plot_spmrkt, width = 7, height = 5, dpi = 300)
  
  # Scatter plot: Resale price vs Proximity to Nearest Hawker
  plot_hawker <- ggplot(resale_final, aes(x = proximity_to_hawker, y = resale_price)) +
    geom_point(alpha = 0.5) +
    labs(title = "Resale Price vs. Proximity to Nearest Hawker", x = "Distance (km)", y = "Resale Price")
  ggsave("plot_resale_price_vs_hawker.png", plot = plot_hawker, width = 7, height = 5, dpi = 300)
  
  # Scatter plot: Resale price vs Proximity to Nearest Park
  plot_parks <- ggplot(resale_final, aes(x = proximity_to_parks, y = resale_price)) +
    geom_point(alpha = 0.5) +
    labs(title = "Resale Price vs. Proximity to Nearest Park", x = "Distance (km)", y = "Resale Price")
  ggsave("plot_resale_price_vs_parks.png", plot = plot_parks, width = 7, height = 5, dpi = 300)
  
  # Scatter plot: Resale price vs No. of Childcares within 350m
  plot_childcare <- ggplot(resale_final, aes(x = within_350m_childcare, y = resale_price)) +
    geom_point(alpha = 0.5) +
    labs(title = "Resale Price vs. No. of Childcares within 350m", x = "No. of Childcares", y = "Resale Price")
  ggsave("plot_resale_price_vs_childcare.png", plot = plot_childcare, width = 7, height = 5, dpi = 300)
  
  # Scatter plot: Resale price vs Proximity to Nearest Bus Stop
  plot_busstop <- ggplot(resale_final, aes(x = proximity_to_busstop, y = resale_price)) +
    geom_point(alpha = 0.5) +
    labs(title = "Resale Price vs. Proximity to Nearest Bus Stop", x = "Distance (km)", y = "Resale Price")
  ggsave("plot_resale_price_vs_busstop.png", plot = plot_busstop, width = 7, height = 5, dpi = 300)
  
  # Scatter plot: Resale price vs No. of Bus Stops within 350m
  plot_within_busstop <- ggplot(resale_final, aes(x = within_350m_busstop, y = resale_price)) +
    geom_point(alpha = 0.5) +
    labs(title = "Resale Price vs. No. of Bus Stops within 350m", x = "No. of Bus Stops", y = "Resale Price")
  ggsave("plot_resale_price_vs_within_busstop.png", plot = plot_within_busstop, width = 7, height = 5, dpi = 300)
  
  # Scatter plot: Resale price vs Proximity to Nearest Mall
  plot_mall <- ggplot(resale_final, aes(x = proximity_to_mall, y = resale_price)) +
    geom_point(alpha = 0.5) +
    labs(title = "Resale Price vs. Proximity to Nearest Mall", x = "Distance (km)", y = "Resale Price")
  ggsave("plot_resale_price_vs_mall.png", plot = plot_mall, width = 7, height = 5, dpi = 300)

Resale Price vs. Area (sqm): There is a strong positive relationship, with resale prices increasing as the floor area of the property increases, indicating that larger flats are valued higher.

Resale Price vs. Storey: Higher floors appear to have a slight positive effect on resale prices, possibly due to better views or reduced noise.

Resale Price vs. Age of Flat (months): Older flats show a decrease in resale price, indicating that newer flats are generally more desirable.

Resale Price vs. Proximity to Nearest MRT exits: Properties closer to MRT exits have higher resale prices, showing that accessibility to public transportation is a key factor in property valuation.

Resale Price vs. Proximity to Nearest Good Primary School: There is no strong correlation between proximity to a good primary school and resale prices, indicating limited impact on value.

Resale Price vs. No. of Good Primary Schools within 1km: The resale price seems to show minimal variation with the number of good primary schools within 1km, indicating that proximity to schools may not strongly influence pricing.

Resale Price vs. Proximity to Nearest Eldercare: Resale prices tend to decrease slightly as distance from the nearest eldercare facility increases, suggesting that closer proximity to eldercare may have a small positive impact on property value.

Resale Price vs. Proximity to Nearest CHAS Clinic: There is a weak negative correlation, where resale prices tend to be slightly higher when closer to CHAS clinics, though the effect appears minimal.

Resale Price vs. Proximity to Nearest Supermarket: Resale prices appear higher when closer to supermarkets, implying that proximity to essential amenities like supermarkets positively impacts housing prices.

Resale Price vs. Proximity to Nearest Hawker Center: Resale prices decrease with distance from hawker centers, suggesting that closeness to affordable food options is a desirable attribute.

Resale Price vs. Proximity to Nearest Park: Proximity to parks shows some clustering of higher resale prices near parks, but the relationship is not very strong, as there is a wide spread across various distances.

Resale Price vs. Number of Childcares within 350m: The number of nearby childcares shows a more scattered trend, with resale prices higher around lower numbers of nearby childcare facilities. This may suggest that too many nearby childcares don’t significantly enhance property value.

Resale Price vs. Proximity to Nearest Bus Stop: There is a very dense clustering of data points close to bus stops, with no strong indication of increased resale price with proximity to bus stops. The resale prices are distributed widely regardless of proximity.

Resale Price vs. Number of Bus Stops within 350m: The resale price does not show a clear trend relative to the number of nearby bus stops. Although resale prices are generally higher for areas with fewer bus stops within 350m, the spread remains wide across different bus stop counts, suggesting that proximity to multiple bus stops within a short distance may not strongly influence resale price.

Resale Price vs. Proximity to Nearest Mall: Resale prices tend to be slightly higher when closer to malls, indicating that proximity to malls may have a mild positive influence on prices. However, there’s still a significant spread across all distances.

A violin plot is also an excellent choice for visualizing the distribution of resale prices across different regions and flat models because it combines elements of both a box plot and a kernel density plot. Here are specific reasons why a violin plot is suitable:

  1. Distribution Insight: Violin plots provide a clear view of the distribution’s shape for each category, revealing where resale prices are concentrated and if there are multiple modes (peaks) in the data.

  2. Comparison Across Categories: By plotting different regions and flat models side-by-side, we can easily compare the price distributions across categories and detect variations in spread, skewness, or the presence of outliers.

  3. Density Information: Unlike box plots, violin plots show the density of the data at different price levels. This is helpful in understanding if resale prices cluster around certain values or if they are more uniformly distributed.

We’ll analyse 4-room flats specifically based on their popularity and representativeness in Singapore’s housing market. Four-room flats are among the most common flat types in Singapore’s HDB system, making them a relevant indicator of general market trends. By focusing on this category, we can gain insights into typical resale prices across regions while comparing different flat models. This narrower focus allows for more meaningful and consistent comparisons without the variability introduced by other flat types, which may have different price dynamics due to their size and demand characteristics.

# Filter for 4-room flats only
  filtered_data <- resale_tidy %>%
    filter(flat_type == "4 ROOM")  # Filtering for 4-room flats
  
  # Create the violin plot with facets for each region, using color for flat model
  plot <- ggplot(filtered_data, aes(x = "", y = resale_price, fill = flat_model)) +
    geom_violin(trim = FALSE) +
    facet_wrap(~ region, scales = "free", ncol = 2) +  # Arrange facets in two columns
    labs(title = "Resale Prices of 4-Room Flats by Flat Model in Each Region",
         x = "Region", y = "Resale Price") +
    scale_y_continuous(labels = scales::dollar_format()) +
    theme_minimal() +
    theme(axis.text.x = element_blank(),  # Remove x-axis text for readability
          axis.title.x = element_blank()) +
    guides(fill = guide_legend(title = "Flat Model"))
  
  # Save the plot as an image file
  ggsave("resale_prices_violin_plot.png", plot = plot, width = 10, height = 8)

The violin plot above provides insights into the distribution of resale prices for 4-room flats across various regions in Singapore, segmented by flat model. Here are some key observations:

  1. Regional Price Variability:

    • Central Region: The resale prices in the Central region have the highest variability, with several flat models (such as Premium Apartment, DBSS) reaching close to $1,500,000. This reflects the higher demand and premium associated with centrally located flats in Singapore.

    • North-East Region: Prices here are generally clustered within a narrower range between $500,000 and $800,000, with DBSS and Premium Apartment models showing higher resale values.

    • East Region: Similar to the North-East, the East region shows moderate resale values between $500,000 and $800,000, with DBSS flats on the higher end.

    • North and West Regions: These regions display the lowest price variability, with most models staying between $400,000 and $800,000. In the North, flat models like DBSS and Premium Apartment tend to show slightly higher resale values, though not as high as those in the Central or East regions.

  2. Flat Model Influence on Prices:

    • DBSS (Design, Build, and Sell Scheme) and Premium Apartment models generally fetch higher resale prices across all regions. This is likely due to the enhanced design, additional amenities, and higher quality associated with these models.

    • Standard and Simplified models tend to have lower resale prices, indicating they are more basic options with fewer amenities, and thus appeal to buyers looking for affordability over additional features.

    • New Generation and Model A are popular across regions but usually fall into a mid-range price point, indicating a balanced offering of quality and affordability.

  3. Distribution and Density:

    • Each violin plot represents the density of resale prices for a given flat model within a region. Wider areas in the violin plot indicate a higher concentration of resale transactions at a specific price point.

    • For example, in the Central region, Premium Apartment flats have a wide spread, suggesting a high concentration of resale prices around the median. However, in the North-East region, DBSS flats have a narrower spread, indicating that resale prices are more concentrated within a smaller range.

  4. Insights by Region:

    • Central Region: Dominated by higher-value flats, with some models displaying very high resale prices due to the prime location.

    • East and North-East Regions: These regions offer a good mix of mid-to-high resale values, particularly for DBSS and Premium Apartment flats.

    • North and West Regions: These are more affordable regions, with flat models generally priced lower compared to the Central area.

In summary, the Central region commands the highest prices, likely due to its proximity to the city center, while the other regions offer a range of mid- to high-priced flats, depending on the model. DBSS and Premium Apartments consistently have higher resale prices across all regions, indicating their premium value in the market. This plot helps identify patterns in resale prices by region and flat model, which could be valuable for buyers, investors, and policymakers.

4.3 Spatial Distribution Analysis

A spatial distribution analysis allows us to understand how resale prices are geographically distributed across different regions of Singapore. By converting the dataset into a spatial format (e.g., sf object with latitude and longitude coordinates), we can create choropleth maps to show variations in resale prices. Using color gradients to represent different price levels, we can visually identify high-value and low-value clusters across Singapore.

This approach highlights specific regions where resale prices are consistently higher or lower, such as areas near central business districts or regions with greater access to amenities. Spatial distribution analysis is particularly useful for understanding location-based trends and identifying regions that may warrant further investigation or targeted modeling techniques.

# Visualize spatial distribution of resale prices
  tmap_mode("view")
tmap mode set to interactive viewing
tm_shape(resale_final) +
    tm_dots(col = "resale_price", palette = "YlOrRd", title = "HDB Resale Prices") +
    tm_layout(title = "Spatial Distribution of HDB Resale Prices")

4.4 Correlation Analysis

Correlation analysis is an essential step for identifying which variables have the strongest influence on resale prices. By selecting relevant numerical columns, including resale price, area, age, remaining lease, and floor level, we can compute pairwise correlations and focus on relationships with resale price. The resulting correlation matrix provides a quick overview of how each feature is associated with resale price, revealing strong positive or negative correlations.

Visualizing this matrix with a heatmap helps to quickly identify which factors are most influential. For example, a high positive correlation between area and resale price would indicate that larger flats tend to sell for more, while a negative correlation with age may suggest that older flats depreciate in value. This analysis informs which features are most relevant for predictive modeling, allowing us to focus on those that have a meaningful impact on resale prices.

Based on the observations from the violin plot, we can select one region and one flat model. For example, we could choose the “Central” region and “DBSS” model if we’re interested in high-value areas and premium flats, or the “West” region with “Standard” flats for a more affordable segment.

# Filtering for Central region and DBSS flat model
  central_DBSS <- resale_final %>%
    filter(flat_type == "4 ROOM", region == "Central", flat_model == "DBSS")
  
  # Filtering for West region and Simplified flat model
  West_Simplified <- resale_final %>%
    filter(flat_type == "4 ROOM", region == "West", flat_model == "Simplified")
# Remove geometry and select specific columns
  central_DBSS_nogeo <- central_DBSS %>%
    st_drop_geometry() %>%
    select(-c(2:5))
  
  # Generate the correlation matrix plot
  correlation_plot <- ggcorrmat(central_DBSS_nogeo[, 2:16])
  
  # Save the correlation plot as an image file
  ggsave("central_DBSS_correlation_matrix.png", plot = correlation_plot, width = 10, height = 8)

The correlation matrix for the “Central” region and “DBSS” model (high-value areas) shows the relationships between various attributes, such as proximity to amenities, storey level, floor area, and remaining lease. Here are some observations:

  1. Strong Positive Correlations:

    • Proximity to public transportation: There is a strong positive correlation between proximity_to_busstop and within_350m_busstop (0.96) and between proximity_to_childcare and within_350m_childcare (0.96). This reflects that flats near these amenities are also generally close to other amenities in high-density, central areas.

    • Between amenities: proximity_to_sprmkt (supermarket) and proximity_to_hawker have a very high correlation (0.98), as do proximity_to_parks and proximity_to_hawker (0.91). This suggests that these amenities are often located near each other, increasing the overall desirability of these locations.

  2. Negative Correlations:

    • Remaining Lease and Proximity to Amenities: There is a notable negative correlation between remaining_lease_total_mths and proximity_to_sprmkt (-0.75), proximity_to_hawker (-0.81), and proximity_to_parks (-0.67). This suggests that older flats (with shorter remaining leases) are located closer to these amenities, possibly due to urban planning changes over time that placed newer buildings farther from the center.

    • Storey Level (Average) and Proximity to Amenities: storey_avg shows negative correlations with amenities like proximity_to_sprmkt (-0.75) and proximity_to_parks (-0.67). This could indicate that higher floors are generally located in areas farther from these amenities, which might align with the planning for more luxurious high-rise buildings away from the central amenities.

  3. Non-significant Relationships:

    • The matrix indicates non-significant correlations with an “X” mark, particularly in relationships with storey_avg and some proximity metrics. For instance, storey_avg and proximity_to_goodprisch (good primary schools) have an insignificant correlation. This suggests that the vertical positioning (floor level) of flats doesn’t meaningfully impact proximity to certain amenities, like schools.
  4. Implications for Resale Price Analysis:

    • High positive correlations among amenity proximities imply that these factors could be clustered, affecting prices similarly. Since older properties are closer to key amenities (shown by the negative correlation with remaining_lease_total_mths), they may attract different buyer demographics compared to newer properties.

    • The lack of significant correlation between storey_avg and proximity variables for certain amenities may imply that floor level does not influence proximity benefits for high-value DBSS flats in the Central region.

  5. Recommendations for Modeling:

    • Feature Selection: Given the high correlations among proximity variables, selecting a few representative amenities rather than all could prevent multicollinearity in models predicting resale prices.

    • Cluster Analysis: Grouping high-correlated amenities could be useful for understanding how proximity affects high-value property prices.

    • Focus on Age-Related Factors: Since remaining lease is strongly correlated with proximity to amenities, it may act as a proxy for central vs. peripheral location in older properties.

# Remove geometry and select specific columns
  West_Simplified_nogeo <- West_Simplified %>%
    st_drop_geometry() %>%
    select(-c(2:5))
  
  # Generate the correlation matrix plot
  correlation_plot <- ggcorrmat(West_Simplified_nogeo[, 2:16])
  
  # Save the correlation plot as an image file
  ggsave("West_Simplified_correlation_matrix.png", plot = correlation_plot, width = 10, height = 8)

The correlation matrix for the “West” region with “Standard” flats (an affordable segment) highlights the relationships between different factors influencing resale prices. Here’s a breakdown:

  1. Positive Correlations Among Amenities:

    • There is a strong positive correlation between proximity variables such as proximity_to_busstop and within_350m_busstop (0.87), proximity_to_childcare and within_350m_childcare (0.72), and proximity_to_goodprisch and within_1km_prisch (0.74). This indicates that these amenities are clustered together, making specific areas more accessible and potentially more desirable.

    • Additionally, some proximity variables like proximity_to_spmrkt (supermarket) and proximity_to_hawker show a moderately positive correlation, suggesting that affordable regions have grouped amenities in certain neighborhoods, creating small commercial hubs.

  2. Negative Correlations with Remaining Lease:

    • There is a moderately negative correlation between remaining_lease_total_mths and several proximity variables, such as proximity_to_hawker (-0.44) and proximity_to_parks (-0.38). This suggests that older flats with shorter leases tend to be closer to amenities, reflecting urban planning trends where older properties are located in more established areas with nearby amenities.
  3. Storey Level and Proximity to Amenities:

    • The average storey level (storey_avg) shows weak correlations with most proximity variables, meaning that in the affordable segment, the vertical positioning (higher or lower floors) of flats is not strongly related to the distance to amenities. This could suggest that height is not a major factor in price determination within this segment in the West region.
  4. Non-significant Relationships:

    • There are several non-significant correlations (marked with “X”) across the matrix, particularly involving storey_avg and various amenity proximities. This lack of significant relationships implies that factors like floor level do not have a clear impact on the accessibility of amenities in affordable areas.
  5. Implications for Resale Price Analysis:

    • The clustering of amenities and the moderate negative correlations with remaining lease suggest that proximity to amenities is more likely in older flats. This factor might be more relevant for affordability, as amenities can increase the desirability of older flats with shorter leases.

    • The overall weak correlation of storey_avg and the proximity variables points to a relatively uniform impact of accessibility, regardless of floor level, indicating that buyers in this segment may prioritize horizontal (location) rather than vertical accessibility.

  6. Recommendations for Further Analysis:

    • Modeling Strategy: Since proximity variables are clustered, one could consider aggregating them or using principal component analysis (PCA) to reduce dimensionality in models predicting resale prices.

    • Focus on Lease and Proximity: Given the moderate relationship between lease length and amenity proximity, models could investigate how remaining lease length and accessibility to clustered amenities jointly affect property desirability.

Overall, EDA provides a comprehensive understanding of the dataset through descriptive statistics, relationship visualizations, spatial distribution analysis, and correlation analysis. Each step contributes valuable insights: descriptive statistics offer a general overview, visualizations highlight trends and patterns, spatial analysis reveals geographical variations, and correlation analysis identifies the most influential factors. Together, these insights form the foundation for developing an effective predictive model for HDB resale prices.

5 Building a Hedonic Pricing Model by using Multiple Linear Regression Method

To predict HDB resale prices effectively, we will start by establishing a multiple linear regression model. Next, we will explore geographically weighted models to capture spatial heterogeneity. This section outlines the key steps for building and evaluating these models.

5.1 Multiple Linear Regression Method

The multiple linear regression model serves as a starting point for understanding the relationship between resale prices and various structural and locational factors. It assumes that relationships between predictors (like area, age, proximity to amenities) and resale price are consistent across all geographic locations.

We’ll be converting flat_model into a factor will automatically treat each category as a separate level in the model without needing one-hot encoding.

# Convert `flat_type`, `region` and `flat_model` to factors
  resale_final <- resale_final %>%
    mutate(flat_type = as.factor(flat_type),
           region = as.factor(region),
           flat_model = as.factor(flat_model))

The code chunk below using lm() to calibrate the multiple linear regression model.

resale_mlr <- lm(formula = resale_price ~ 
                   floor_area_sqm + 
                   storey_avg + 
                   remaining_lease_total_mths + 
                   proximity_to_mrt + 
                   proximity_to_goodprisch + 
                   within_1km_prisch + 
                   proximity_to_eldercare + 
                   proximity_to_CHAS + 
                   proximity_to_spmrkt + 
                   proximity_to_hawker + 
                   proximity_to_parks + 
                   proximity_to_childcare + 
                   within_350m_childcare + 
                   proximity_to_busstop + 
                   within_350m_busstop + 
                   proximity_to_mall + 
                   flat_type + 
                   region + 
                   flat_model,  # Assuming 'flat_type', 'region'                  , and 'flat_model' are factors
                   data = resale_final)
write_rds(resale_mlr, "data/HDB/rds/resale_mlr.rds")
resale_mlr <- read_rds("data/HDB/rds/resale_mlr.rds")

5.2 Preparing Publication Quality Table: olsrr method

With reference to the report above, it is clear that not all the independent variables are statistically significant. We will revised the model by removing those variables which are not statistically significant.

Now, we are ready to calibrate the revised model by using the code chunk below.

ols_regress(resale_mlr)
                              Model Summary                                
  --------------------------------------------------------------------------
  R                           0.916       RMSE                    64519.442 
  R-Squared                   0.839       MSE                4169484825.877 
  Adj. R-Squared              0.839       Coef. Var                  11.475 
  Pred R-Squared              0.838       AIC                    588654.342 
  MAE                     47856.606       SBC                    588968.959 
  --------------------------------------------------------------------------
   RMSE: Root Mean Square Error 
   MSE: Mean Square Error 
   MAE: Mean Absolute Error 
   AIC: Akaike Information Criteria 
   SBC: Schwarz Bayesian Criteria 
  
                                       ANOVA                                       
  --------------------------------------------------------------------------------
                      Sum of                                                      
                     Squares           DF       Mean Square       F          Sig. 
  --------------------------------------------------------------------------------
  Regression    5.102031e+14           37      1.378927e+13    3307.189    0.0000 
  Residual      9.805377e+13        23517    4169484825.877                       
  Total         6.082569e+14        23554                                         
  --------------------------------------------------------------------------------
  
                                                       Parameter Estimates                                                       
  ------------------------------------------------------------------------------------------------------------------------------
                             model           Beta    Std. Error    Std. Beta       t         Sig           lower          upper 
  ------------------------------------------------------------------------------------------------------------------------------
                       (Intercept)    -212949.177     16847.847                  -12.640    0.000    -245972.049    -179926.305 
                    floor_area_sqm       5180.848        90.734        0.614      57.100    0.000       5003.004       5358.691 
                        storey_avg       5839.961        76.681        0.221      76.159    0.000       5689.661       5990.260 
        remaining_lease_total_mths        443.262         4.076        0.506     108.755    0.000        435.273        451.251 
                  proximity_to_mrt     -24106.048      1255.627       -0.055     -19.198    0.000     -26567.159     -21644.938 
           proximity_to_goodprisch      -6768.217       497.990       -0.054     -13.591    0.000      -7744.309      -5792.125 
                 within_1km_prisch       7761.181      1316.564        0.020       5.895    0.000       5180.630      10341.732 
            proximity_to_eldercare      -8762.624       767.591       -0.034     -11.416    0.000     -10267.152      -7258.096 
                 proximity_to_CHAS       4755.351      4397.039        0.003       1.081    0.279      -3863.132      13373.833 
               proximity_to_spmrkt     -16311.321      2928.008       -0.017      -5.571    0.000     -22050.407     -10572.236 
               proximity_to_hawker     -26553.249       988.416       -0.085     -26.864    0.000     -28490.607     -24615.890 
                proximity_to_parks     -26393.337      1287.289       -0.062     -20.503    0.000     -28916.507     -23870.166 
            proximity_to_childcare      10947.090      5155.185        0.006       2.124    0.034        842.594      21051.586 
             within_350m_childcare      -2681.812       243.666       -0.033     -11.006    0.000      -3159.413      -2204.210 
              proximity_to_busstop      22861.952      8273.740        0.008       2.763    0.006       6644.885      39079.020 
               within_350m_busstop        406.545       169.873        0.007       2.393    0.017         73.583        739.507 
                 proximity_to_mall      -9397.458      1407.183       -0.021      -6.678    0.000     -12155.628      -6639.287 
                   flat_type4 ROOM       3494.984      2647.682        0.011       1.320    0.187      -1694.645       8684.613 
                   flat_type5 ROOM      13686.820      5222.222        0.037       2.621    0.009       3450.927      23922.714 
                        regionEast    -107451.956      1723.780       -0.230     -62.335    0.000    -110830.678    -104073.235 
                       regionNorth    -193077.423      1848.693       -0.468    -104.440    0.000    -196700.982    -189453.864 
                  regionNorth-East    -156387.441      1668.560       -0.433     -93.726    0.000    -159657.926    -153116.955 
                        regionWest    -157204.041      1668.027       -0.409     -94.246    0.000    -160473.482    -153934.601 
           flat_modelAdjoined flat     105692.643     18885.149        0.022       5.597    0.000      68676.526     142708.760 
                    flat_modelDBSS     191297.003     14661.522        0.133      13.048    0.000     162559.470     220034.537 
                flat_modelImproved      33241.758     14189.024        0.090       2.343    0.019       5430.350      61053.165 
     flat_modelImproved-Maisonette     176551.841     32212.926        0.016       5.481    0.000     113412.417     239691.265 
                 flat_modelModel A      39401.864     14203.221        0.121       2.774    0.006      11562.629      67241.098 
      flat_modelModel A-Maisonette     149003.432     17590.334        0.038       8.471    0.000     114525.236     183481.628 
                flat_modelModel A2      44323.883     14751.886        0.030       3.005    0.003      15409.230      73238.536 
          flat_modelNew Generation      91881.327     14276.932        0.191       6.436    0.000      63897.614     119865.040 
       flat_modelPremium Apartment      36270.736     14226.036        0.069       2.550    0.011       8386.784      64154.689 
  flat_modelPremium Apartment Loft     289691.450     24927.185        0.037      11.622    0.000     240832.552     338550.349 
              flat_modelSimplified     102307.694     14430.332        0.125       7.090    0.000      74023.306     130592.082 
                flat_modelStandard      31229.077     14538.977        0.030       2.148    0.032       2731.738      59726.416 
                 flat_modelTerrace     406349.619     28478.697        0.044      14.269    0.000     350529.526     462169.711 
                 flat_modelType S1     351164.338     17838.349        0.087      19.686    0.000     316200.018     386128.659 
                 flat_modelType S2     458875.210     24164.316        0.062      18.990    0.000     411511.584     506238.836 
  ------------------------------------------------------------------------------------------------------------------------------

5.3 Preparing Publication Quality Table: gtsummary method

The gtsummary package provides an elegant and flexible way to create publication-ready summary tables in R.

In the code chunk below, tbl_regression() is used to create a well formatted regression report.

tbl_regression(resale_mlr, intercept = TRUE)
Characteristic Beta 95% CI1 p-value
(Intercept) -212,949 -245,972, -179,926 <0.001
floor_area_sqm 5,181 5,003, 5,359 <0.001
storey_avg 5,840 5,690, 5,990 <0.001
remaining_lease_total_mths 443 435, 451 <0.001
proximity_to_mrt -24,106 -26,567, -21,645 <0.001
proximity_to_goodprisch -6,768 -7,744, -5,792 <0.001
within_1km_prisch 7,761 5,181, 10,342 <0.001
proximity_to_eldercare -8,763 -10,267, -7,258 <0.001
proximity_to_CHAS 4,755 -3,863, 13,374 0.3
proximity_to_spmrkt -16,311 -22,050, -10,572 <0.001
proximity_to_hawker -26,553 -28,491, -24,616 <0.001
proximity_to_parks -26,393 -28,917, -23,870 <0.001
proximity_to_childcare 10,947 843, 21,052 0.034
within_350m_childcare -2,682 -3,159, -2,204 <0.001
proximity_to_busstop 22,862 6,645, 39,079 0.006
within_350m_busstop 407 74, 740 0.017
proximity_to_mall -9,397 -12,156, -6,639 <0.001
flat_type


    3 ROOM — —
    4 ROOM 3,495 -1,695, 8,685 0.2
    5 ROOM 13,687 3,451, 23,923 0.009
region


    Central — —
    East -107,452 -110,831, -104,073 <0.001
    North -193,077 -196,701, -189,454 <0.001
    North-East -156,387 -159,658, -153,117 <0.001
    West -157,204 -160,473, -153,935 <0.001
flat_model


    3Gen — —
    Adjoined flat 105,693 68,677, 142,709 <0.001
    DBSS 191,297 162,559, 220,035 <0.001
    Improved 33,242 5,430, 61,053 0.019
    Improved-Maisonette 176,552 113,412, 239,691 <0.001
    Model A 39,402 11,563, 67,241 0.006
    Model A-Maisonette 149,003 114,525, 183,482 <0.001
    Model A2 44,324 15,409, 73,239 0.003
    New Generation 91,881 63,898, 119,865 <0.001
    Premium Apartment 36,271 8,387, 64,155 0.011
    Premium Apartment Loft 289,691 240,833, 338,550 <0.001
    Simplified 102,308 74,023, 130,592 <0.001
    Standard 31,229 2,732, 59,726 0.032
    Terrace 406,350 350,530, 462,170 <0.001
    Type S1 351,164 316,200, 386,129 <0.001
    Type S2 458,875 411,512, 506,239 <0.001
1 CI = Confidence Interval

5.4 Key Insights:

Several key insights about the fitted linear regression model:

  1. Overall Model Fit:

    • The model shows a strong fit, with an R-squared of 0.839 and an Adjusted R-squared of 0.8385, indicating that around 83.9% of the variance in resale prices is explained by the independent variables included in the model. This is a high R-squared value, suggesting the model is well-specified for predicting resale prices based on the available features.

    • The Residual Standard Error of 64,570 suggests that predictions deviate by about this much on average, which may indicate room for further refinement, particularly in capturing nuances in higher-priced or unique units.

  2. Significance of Variables:

    • Most predictors are statistically significant (p < 0.05), as evidenced by the low p-values in the table. However, the expanded table includes 95% Confidence Intervals (CI) for each coefficient, allowing for an assessment of the precision of each estimate.

    • Floor Area, Storey Average, and Remaining Lease are all highly significant predictors with narrow confidence intervals, confirming that they are reliable indicators of resale price.

  3. Effect of Regions:

    • Using the Central region as a baseline, we see that all other regions have negative coefficients. Specifically:

      • East: Resale prices are approximately SGD 107,452 lower than in the Central region, with a narrow confidence interval that confirms the precision of this estimate.

      • North: Shows an even larger negative effect on resale prices, about SGD 193,077 lower than Central, suggesting that geographical proximity to Singapore’s urban core has a significant effect on property values.

      • North-East and West: These regions also display a significant price discount relative to Central, likely due to accessibility and perceived amenities.

    • This strong regional variation emphasizes the importance of location in resale pricing, with Central commanding the highest premium.

  4. Effect of Flat Model:

    • The model uses 3Gen as the baseline flat model. Different flat models show substantial variation in pricing. For example:

      • Type S2 flats have a very high positive effect, with an estimated increase of approximately SGD 458,875 over the baseline, highlighting this model’s desirability or premium status.

      • Other high-value models like DBSS (+SGD 191,297) and Improved-Maisonette (+SGD 176,552) indicate that certain layouts or schemes, possibly due to exclusivity or enhanced facilities, contribute significantly to price.

      • In contrast, Standard flats have a modest positive impact of around SGD 31,229, showing that standard flat types are valued less compared to other premium models.

    • The inclusion of these models as categorical variables helps capture the inherent structural and aesthetic differences among units.

  5. Effect of Proximity to Amenities:

    • Proximity measures to amenities like MRT stations, bus stops, and parks have notable effects on resale prices. For instance:

      • Proximity to MRT (-SGD 24,106): Being closer to an MRT station reduces the resale price, suggesting that noise or congestion associated with MRT stations could be perceived negatively by buyers.

      • Within 1km of Primary Schools (+SGD 7,761): This positively affects price, highlighting the appeal of accessible schooling for families.

      • Proximity to Parks (-SGD 26,393): Despite parks being a positive amenity, the negative coefficient suggests that there might be other factors at play, such as over-saturation in certain areas.

    • The expanded model’s confidence intervals reinforce the interpretation of these proximity variables, with most intervals not overlapping zero, which confirms their significance.

  6. Effects of Flat Types (3 ROOM, 4 ROOM, 5 ROOM):

    • The model treats 3 ROOM flats as the baseline. Compared to 3 ROOM flats:

      • 4 ROOM flats show a slight positive association with resale price, but the effect size (approximately SGD 3,495) is not statistically significant (p = 0.2), suggesting it does not differ much from the baseline in value.

      • 5 ROOM flats have a larger, statistically significant positive effect (+SGD 13,687), indicating higher demand or perceived value for these larger units.

  7. Interpretation of Confidence Intervals:

    • The inclusion of 95% confidence intervals for each predictor provides insight into the reliability of each estimate. For example:

      • For Floor Area, the CI ranges from SGD 5,003 to SGD 5,359 per sqm, showing a tight interval that suggests high reliability.

      • For categorical variables, such as regions and flat models, the confidence intervals further confirm the significant differences among categories, especially for high-impact variables like DBSS and Type S2 models.

  8. Additional Model Diagnostics:

    • RMSE (Root Mean Square Error) of 64,519 provides a sense of the average magnitude of residuals in the original resale price units. Given the price range, this suggests reasonable model accuracy but indicates room for fine-tuning.

    • AIC and SBC values, also shown in the model summary, are used for model comparison. Lower values indicate better model performance in terms of fit and complexity. These metrics might be useful if comparing against alternative models or specifications.

Summary

This model successfully captures significant predictors of resale price, with clear regional and flat model effects, as well as a nuanced understanding of how proximity to amenities affects valuation. The high R-squared value and statistically significant predictors underscore the robustness of this model.

6 Building Random Forest Model

6.1 Data Sampling

The entire data are split into training and test data sets with 65% and 35% respectively by using initial_split() of rsample package. rsample is one of the package of tigymodels.

set.seed(1234)
            resale_split <- initial_split(resale_final, 
                                          prop = 6.5/10,)
            train_data <- training(resale_split)
            test_data <- testing(resale_split)
write_rds(train_data, "data/HDB/rds/train_data.rds")
            write_rds(test_data, "data/HDB/rds/test_data.rds")
train_data <- read_rds("data/HDB/rds/train_data.rds")
            test_data <- read_rds("data/HDB/rds/test_data.rds")

6.2 Preparing coordinates data

6.2.1 Extracting coordinates data

The code chunk below extract the x,y coordinates of the full, training and test data sets.

coords <- st_coordinates(resale_final)
            coords_train <- st_coordinates(train_data)
            coords_test <- st_coordinates(test_data)

Before continue, we write all the output into rds for future used.

coords_train <- write_rds(coords_train, "data/HDB/rds/coords_train.rds" )
            coords_test <- write_rds(coords_test, "data/HDB/rds/coords_test.rds" )
coords_train <- read_rds("data/HDB/rds/coords_train.rds")
            coords_test <- read_rds("data/HDB/rds/coords_test.rds")

6.2.2 Dropping geometry field

First, we will drop geometry column of the sf data.frame by using st_drop_geometry() of sf package.

train_data <- train_data %>% 
              st_drop_geometry()

6.3 Calibrating Random Forest Model

We’ll calibrate a model to predict HDB resale price by using random forest function of ranger package.

set.seed(1234)
            # Convert columns to factors if they are not already
            train_data$flat_type <- as.factor(train_data$flat_type)
            train_data$region <- as.factor(train_data$region)
            train_data$flat_model <- as.factor(train_data$flat_model)
            
            # Run the random forest model
            set.seed(1234)
            rf <- ranger(formula = resale_price ~ 
                             floor_area_sqm + 
                             storey_avg + 
                             remaining_lease_total_mths + 
                             proximity_to_mrt + 
                             proximity_to_goodprisch + 
                             within_1km_prisch + 
                             proximity_to_eldercare + 
                             proximity_to_CHAS + 
                             proximity_to_spmrkt + 
                             proximity_to_hawker + 
                             proximity_to_parks + 
                             proximity_to_childcare + 
                             within_350m_childcare + 
                             proximity_to_busstop + 
                             within_350m_busstop + 
                             proximity_to_mall + 
                             flat_type + 
                             region + 
                             flat_model,  # Assuming 'flat_type', 'region'                  , and 'flat_model' are factors
                             data = train_data
            )
write_rds(rf, "data/HDB/rds/rf.rds")
rf <- read_rds("data/HDB/rds/rf.rds")
# Check the model output
            rf
function (n, df1, df2, ncp) 
            {
                if (missing(ncp)) 
                    .Call(C_rf, n, df1, df2)
                else (rchisq(n, df1, ncp = ncp)/df1)/(rchisq(n, df2)/df2)
            }
            <bytecode: 0x0000022fa31b0be0>
            <environment: namespace:stats>

6.4 Key Insights:

  1. Call: This shows the formula and data used for the model. It confirms that resale_price is the dependent variable, and the other variables in the formula are predictors.

  2. Type: The model is a regression model, meaning it predicts a continuous outcome (resale price) based on the input features.

  3. Number of Trees: The model used 500 trees, which is typical for a random forest model to ensure stability and robustness in predictions.

  4. Sample Size: The model trained on 15,310 data points. This is the total number of observations in the training dataset after any preprocessing.

  5. Number of Independent Variables: There are 19 predictor variables used in the model. These include both numerical features (e.g., floor_area_sqm, remaining_lease_total_mths) and categorical features (e.g., flat_type, region, flat_model).

  6. Mtry: This is the number of variables randomly sampled as candidates at each split, set to 4 in this case. In regression, it’s generally recommended to set mtry to around the square root of the number of predictors, which the model has done here.

  7. Target Node Size: This is the minimum number of data points in a node before a split is attempted. A target node size of 5 indicates that the trees were allowed to grow fairly deep, capturing more nuances in the data.

  8. Variable Importance Mode: The model did not calculate variable importance metrics in this run (none). To get insights into which variables contribute most to the model, you could rerun the model with variable importance set to impurity or permutation.

  9. Out-of-Bag (OOB) Prediction Error (MSE): The mean squared error (MSE) for the model based on out-of-bag samples (i.e., samples not used in each tree’s training) is about 135,242,786.4. This metric indicates the average squared difference between the predicted and actual values. A lower MSE indicates a better model fit. However, given that the actual resale prices could vary widely, the raw MSE value alone is less intuitive than its square root, which would give the average error in the same units as resale_price.

  10. Out-of-Bag R-Squared (OOB R²): The OOB R² of 0.9481 indicates that approximately 94.8% of the variance in resale prices can be explained by this model. This is a high R² value, suggesting that the model fits the data well and that the predictors capture a significant portion of the variability in the resale prices.

Summary

The model explains a large portion of the variability in resale prices (OOB R² of 94.8%), which suggests that the selected predictors are well-suited for this task. However, the MSE suggests some degree of error in individual predictions, which is typical with high-dimensional and complex datasets. If you want to understand which variables contribute most to the predictions, consider recalculating the model with variable importance metrics enabled. This would allow you to identify the predictors with the most significant impact on resale prices and refine the model if needed.

7 Calibrating gwr predictive method

We’ll calibrate a model to predict HDB resale price by using geographically weighted regression method of GWmodel package.

Due to time constraint, we’ll select “West” region with “Standard” flats (an affordable segment) for our prediction model.

7.1 Data Sampling

The entire data are split into training and test data sets with 65% and 35% respectively by using initial_split() of rsample package. rsample is one of the package of tigymodels.

set.seed(1234)
            W_split <- initial_split(West_Simplified, 
                                          prop = 6.5/10,)
            Wtrain_data <- training(W_split)
            Wtest_data <- testing(W_split)
write_rds(Wtrain_data, "data/HDB/rds/Wtrain_data.rds")
            write_rds(Wtest_data, "data/HDB/rds/Wtest_data.rds")
Wtrain_data <- read_rds("data/HDB/rds/Wtrain_data.rds")
            Wtest_data <- read_rds("data/HDB/rds/Wtest_data.rds")

7.2 Converting the sf data.frame to SpatialPointDataFrame

train_data_sp <- as_Spatial(Wtrain_data)
            train_data_sp
class       : SpatialPointsDataFrame 
            features    : 53 
            extent      : 15874.38, 21512.98, 33696.28, 40886.84  (xmin, xmax, ymin, ymax)
            crs         : +proj=tmerc +lat_0=1.36666666666667 +lon_0=103.833333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +ellps=WGS84 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs 
            variables   : 21
            names       : resale_price,        town, region, flat_type, flat_model, floor_area_sqm, storey_avg, remaining_lease_total_mths,  proximity_to_mrt, proximity_to_goodprisch, within_1km_prisch, proximity_to_eldercare,    proximity_to_CHAS, proximity_to_spmrkt, proximity_to_hawker, ... 
            min values  :       393000, Bukit Batok,   West,    4 ROOM, Simplified,             83,          2,                        729, 0.109513324694062,       0.733705947485525,                 0,       0.14154563839911, 1.63444770069886e-08,  0.0398263831270178,   0.191639179549987, ... 
            max values  :       480000, Jurong West,   West,    4 ROOM, Simplified,             89,         11,                        779,  1.39680251695794,         5.3465481705808,                 1,       2.41846152812648,    0.492061964279255,   0.622852294951817,    1.97345230639643, ... 

7.3 Computing adaptive bandwidth

Next, bw.gwr() of GWmodel package will be used to determine the optimal bandwidth to be used.

The code chunk below is used to determine adaptive bandwidth and CV method is used to determine the optimal bandwidth.

set.seed(1234)
            bw_adaptive <- bw.gwr(formula = resale_price ~ 
                             floor_area_sqm + 
                             storey_avg + 
                             remaining_lease_total_mths + 
                             proximity_to_mrt + 
                             proximity_to_goodprisch + 
                             within_1km_prisch + 
                             proximity_to_eldercare + 
                             proximity_to_CHAS + 
                             proximity_to_spmrkt + 
                             proximity_to_hawker + 
                             proximity_to_parks + 
                             proximity_to_childcare + 
                             within_350m_childcare + 
                             proximity_to_busstop + 
                             within_350m_busstop + 
                             proximity_to_mall,
                             data=train_data_sp,
                             approach="CV",
                             kernel="gaussian",
                             adaptive=TRUE,
                             longlat=FALSE)

This output shows the cross-validation (CV) scores associated with different adaptive bandwidth values in a Geographically Weighted Regression (GWR) or a similar spatial model. Here’s how to interpret it:

  1. Adaptive Bandwidth Values: The bandwidth here represents the number of nearest neighbors considered in the localized models. Smaller bandwidths mean the model is more sensitive to local variations, while larger bandwidths make the model more global.

  2. CV Score: The CV score measures how well each bandwidth performs in cross-validation, with a lower score indicating better model performance. The score helps to balance model complexity and prediction accuracy, where lower values typically indicate a better fit to the data.

  3. Selecting the Optimal Bandwidth: From the list, you would typically select the bandwidth with the lowest CV score as the optimal choice. In this case, the bandwidth with a CV score of 20226641242 (appearing twice at bandwidths of 40) is the lowest observed. This suggests that a bandwidth of 40 may be the best option for balancing model accuracy and generalization.

In summary, based on this output, you would likely choose an adaptive bandwidth of 40, as it yields the lowest CV score and therefore the best cross-validated performance for your model.

Let’s save the model output by using the code chunk below.

write_rds(bw_adaptive, "data/HDB/rds/bw_adaptive.rds")

The code chunk below can be used to retrieve the save model in future.

bw_adaptive <- read_rds("data/HDB/rds/bw_adaptive.rds")

7.4 Constructing the adaptive bandwidth gwr model

Now, we can go ahead to calibrate the gwr-based hedonic pricing model by using adaptive bandwidth and Gaussian kernel as shown in the code chunk below.

gwr_adaptive <- gwr.basic(formula = resale_price ~ 
                             floor_area_sqm + 
                             storey_avg + 
                             remaining_lease_total_mths + 
                             proximity_to_mrt + 
                             proximity_to_goodprisch + 
                             within_1km_prisch + 
                             proximity_to_eldercare + 
                             proximity_to_CHAS + 
                             proximity_to_spmrkt + 
                             proximity_to_hawker + 
                             proximity_to_parks + 
                             proximity_to_childcare + 
                             within_350m_childcare + 
                             proximity_to_busstop + 
                             within_350m_busstop + 
                             proximity_to_mall,
                                      data=train_data_sp,
                                      bw=bw_adaptive, 
                                      kernel = 'gaussian', 
                                      adaptive=TRUE,
                                      longlat = FALSE)

The code chunk below will be used to save the model in rds format for future use.

write_rds(gwr_adaptive, "data/HDB/rds/gwr_adaptive.rds")

7.5 Retrieve gwr output object

The code chunk below will be used to retrieve the save gwr model object.

gwr_adaptive <- read_rds("data/HDB/rds/gwr_adaptive.rds")

The code below can be used to display the model output.

gwr_adaptive
   ***********************************************************************
               *                       Package   GWmodel                             *
               ***********************************************************************
               Program starts at: 2024-11-11 12:35:28.28084 
               Call:
               gwr.basic(formula = resale_price ~ floor_area_sqm + storey_avg + 
                remaining_lease_total_mths + proximity_to_mrt + proximity_to_goodprisch + 
                within_1km_prisch + proximity_to_eldercare + proximity_to_CHAS + 
                proximity_to_spmrkt + proximity_to_hawker + proximity_to_parks + 
                proximity_to_childcare + within_350m_childcare + proximity_to_busstop + 
                within_350m_busstop + proximity_to_mall, data = train_data_sp, 
                bw = bw_adaptive, kernel = "gaussian", adaptive = TRUE, longlat = FALSE)
            
               Dependent (y) variable:  resale_price
               Independent variables:  floor_area_sqm storey_avg remaining_lease_total_mths proximity_to_mrt proximity_to_goodprisch within_1km_prisch proximity_to_eldercare proximity_to_CHAS proximity_to_spmrkt proximity_to_hawker proximity_to_parks proximity_to_childcare within_350m_childcare proximity_to_busstop within_350m_busstop proximity_to_mall
               Number of data points: 53
               ***********************************************************************
               *                    Results of Global Regression                     *
               ***********************************************************************
            
               Call:
                lm(formula = formula, data = data)
            
               Residuals:
               Min     1Q Median     3Q    Max 
            -27218  -8073   1380   9822  26935 
            
               Coefficients:
                                          Estimate Std. Error t value Pr(>|t|)  
               (Intercept)                209005.3   325770.2   0.642   0.5252  
               floor_area_sqm               3870.5     2328.8   1.662   0.1052  
               storey_avg                   1555.7      862.5   1.804   0.0797 .
               remaining_lease_total_mths   -171.1      356.8  -0.479   0.6345  
               proximity_to_mrt           -14587.2    26510.8  -0.550   0.5856  
               proximity_to_goodprisch     -2693.8    15598.5  -0.173   0.8639  
               within_1km_prisch            4097.2    15770.7   0.260   0.7965  
               proximity_to_eldercare       4951.2    19685.9   0.252   0.8029  
               proximity_to_CHAS          -15844.3    35057.6  -0.452   0.6540  
               proximity_to_spmrkt         -3048.0    34614.6  -0.088   0.9303  
               proximity_to_hawker         20348.1    15866.5   1.282   0.2079  
               proximity_to_parks          -9263.6    26084.5  -0.355   0.7246  
               proximity_to_childcare      -7379.6    51792.4  -0.142   0.8875  
               within_350m_childcare       -3804.8     3838.0  -0.991   0.3281  
               proximity_to_busstop        67798.6    59328.9   1.143   0.2607  
               within_350m_busstop          2185.8     1822.9   1.199   0.2383  
               proximity_to_mall           12992.6    35002.4   0.371   0.7127  
            
               ---Significance stars
               Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
               Residual standard error: 15640 on 36 degrees of freedom
               Multiple R-squared: 0.4686
               Adjusted R-squared: 0.2324 
               F-statistic: 1.984 on 16 and 36 DF,  p-value: 0.04377 
               ***Extra Diagnostic information
               Residual sum of squares: 8809680695
               Sigma(hat): 13143.01
               AIC:  1189.635
               AICc:  1209.753
               BIC:  1243.566
               ***********************************************************************
               *          Results of Geographically Weighted Regression              *
               ***********************************************************************
            
               *********************Model calibration information*********************
               Kernel function: gaussian 
               Adaptive bandwidth: 40 (number of nearest neighbours)
               Regression points: the same locations as observations are used.
               Distance metric: Euclidean distance metric is used.
            
               ****************Summary of GWR coefficient estimates:******************
                                                Min.    1st Qu.     Median    3rd Qu.
               Intercept                  103276.200 116917.878 125143.179 132045.986
               floor_area_sqm               3460.170   3909.214   4015.135   5258.983
               storey_avg                   1301.608   1377.038   1603.782   1646.507
               remaining_lease_total_mths   -305.807   -279.591   -109.236    -84.887
               proximity_to_mrt           -17544.314 -17128.293 -16636.057  -7304.776
               proximity_to_goodprisch     -3829.180  -2700.017  -2382.457   3204.890
               within_1km_prisch           -6157.607  -3737.040   4618.721   5608.667
               proximity_to_eldercare      -3544.639  -1753.441   3912.421   4406.509
               proximity_to_CHAS          -23946.423 -16486.447 -13718.926   4714.188
               proximity_to_spmrkt         -1459.413   2042.040   3747.888   4883.473
               proximity_to_hawker         17433.803  18666.208  20417.845  21040.997
               proximity_to_parks         -12673.527  -7283.971  -5557.734   1667.461
               proximity_to_childcare     -28332.788 -17460.966 -14701.844  32381.501
               within_350m_childcare       -5270.263  -3880.349  -3530.989    204.700
               proximity_to_busstop        60283.800  61431.791  79340.573  82077.784
               within_350m_busstop          1429.394   1528.051   2318.671   2371.214
               proximity_to_mall           12078.655  19120.249  20133.757  28929.507
                                               Max.
               Intercept                  218245.94
               floor_area_sqm               5539.89
               storey_avg                   1682.73
               remaining_lease_total_mths    -78.42
               proximity_to_mrt            -4834.40
               proximity_to_goodprisch      4408.98
               within_1km_prisch            7601.88
               proximity_to_eldercare       6276.63
               proximity_to_CHAS            6366.82
               proximity_to_spmrkt          6567.95
               proximity_to_hawker         21647.15
               proximity_to_parks           2647.07
               proximity_to_childcare      37758.05
               within_350m_childcare         767.48
               proximity_to_busstop        83640.86
               within_350m_busstop          2568.17
               proximity_to_mall           32001.73
               ************************Diagnostic information*************************
               Number of data points: 53 
               Effective number of parameters (2trace(S) - trace(S'S)): 21.21632 
               Effective degrees of freedom (n-2trace(S) + trace(S'S)): 31.78368 
               AICc (GWR book, Fotheringham, et al. 2002, p. 61, eq 2.33): 1215.556 
               AIC (GWR book, Fotheringham, et al. 2002,GWR p. 96, eq. 4.22): 1166.083 
               BIC (GWR book, Fotheringham, et al. 2002,GWR p. 61, eq. 2.34): 1170.997 
               Residual sum of squares: 7712357500 
               R-square value:  0.5348005 
               Adjusted R-square value:  0.2141817 
            
               ***********************************************************************
               Program stops at: 2024-11-11 12:35:28.291716 

7.6 Converting the test data from sf data.frame to SpatialPointDataFrame

test_data_sp <- Wtest_data %>%
              as_Spatial()
            test_data_sp
class       : SpatialPointsDataFrame 
            features    : 29 
            extent      : 16206.33, 21512.98, 33696.28, 40886.84  (xmin, xmax, ymin, ymax)
            crs         : +proj=tmerc +lat_0=1.36666666666667 +lon_0=103.833333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +ellps=WGS84 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs 
            variables   : 21
            names       : resale_price,        town, region, flat_type, flat_model, floor_area_sqm, storey_avg, remaining_lease_total_mths, proximity_to_mrt, proximity_to_goodprisch, within_1km_prisch, proximity_to_eldercare,    proximity_to_CHAS,  proximity_to_spmrkt, proximity_to_hawker, ... 
            min values  :       395000, Bukit Batok,   West,    4 ROOM, Simplified,             83,          2,                        733,  0.1168428466505,        1.06080581347727,                 0,      0.224055609298717, 1.94724087782615e-08, 7.13026183898598e-07,   0.176832997150221, ... 
            max values  :       495000, Jurong West,   West,    4 ROOM, Simplified,             89,         11,                        777, 1.39680251695794,         5.3465481705808,                 0,       2.41846152812648,    0.479837420413484,     0.61067387833907,    2.08744917834828, ... 

7.7 Computing predicted values of the test data

gwr_pred <- gwr.predict(formula = resale_price ~ 
                             floor_area_sqm + 
                             storey_avg + 
                             remaining_lease_total_mths + 
                             proximity_to_mrt + 
                             proximity_to_goodprisch + 
                             within_1km_prisch + 
                             proximity_to_eldercare + 
                             proximity_to_CHAS + 
                             proximity_to_spmrkt + 
                             proximity_to_hawker + 
                             proximity_to_parks + 
                             proximity_to_childcare + 
                             within_350m_childcare + 
                             proximity_to_busstop + 
                             within_350m_busstop + 
                             proximity_to_mall,
                                    data=train_data_sp, 
                                    predictdata = test_data_sp, 
                                    bw=40, 
                                    kernel = 'gaussian', 
                                    adaptive=TRUE, 
                                    longlat = FALSE)

The code chunk below will be used to save the model in rds format for future use.

write_rds(gwr_pred, "data/HDB/rds/gwr_pred.rds")

The code chunk below will be used to retrieve the save gwr model object.

gwr_pred <- read_rds("data/HDB/rds/gwr_pred.rds")

The code below can be used to display the model output.

gwr_pred
   ***********************************************************************
               *                       Package   GWmodel                             *
               ***********************************************************************
               Program starts at: 2024-11-11 12:52:33.414486 
               Call:
               gwr.predict(formula = resale_price ~ floor_area_sqm + storey_avg + 
                remaining_lease_total_mths + proximity_to_mrt + proximity_to_goodprisch + 
                within_1km_prisch + proximity_to_eldercare + proximity_to_CHAS + 
                proximity_to_spmrkt + proximity_to_hawker + proximity_to_parks + 
                proximity_to_childcare + within_350m_childcare + proximity_to_busstop + 
                within_350m_busstop + proximity_to_mall, data = train_data_sp, 
                predictdata = test_data_sp, bw = 40, kernel = "gaussian", 
                adaptive = TRUE, longlat = FALSE)
            
               Dependent (y) variable for prediction:  resale_price
               Independent variables:  floor_area_sqm storey_avg remaining_lease_total_mths proximity_to_mrt proximity_to_goodprisch within_1km_prisch proximity_to_eldercare proximity_to_CHAS proximity_to_spmrkt proximity_to_hawker proximity_to_parks proximity_to_childcare within_350m_childcare proximity_to_busstop within_350m_busstop proximity_to_mall
               Number of data points: 53
               ***********************************************************************
               *     Results of Geographically Weighted Regression for prediction    *
               ***********************************************************************
            
               *********************Model calibration information*********************
               Kernel function: gaussian 
               Adaptive bandwidth: 40 (number of nearest neighbours)
               Distance metric: Euclidean distance metric is used.
            
               ****************Summary of GWR coefficient estimates:******************
                                                     Min.    1st Qu.     Median    3rd Qu.
               Intercept_coef                   99921.875 106077.652 116574.187 132045.986
               floor_area_sqm_coef               3469.918   4027.749   4100.722   5095.451
               storey_avg_coef                   1308.402   1412.061   1608.985   1664.558
               remaining_lease_total_mths_coef   -303.040   -261.827   -101.184    -84.887
               proximity_to_mrt_coef           -17474.129 -17083.877 -16703.385  -9053.199
               proximity_to_goodprisch_coef     -3807.324  -2382.457  -2068.831   2232.594
               within_1km_prisch_coef           -5869.894  -2452.481   4342.919   5167.799
               proximity_to_eldercare_coef      -3312.198   -585.822   3730.485   4108.243
               proximity_to_CHAS_coef          -23914.949 -15799.882 -11973.248   2984.952
               proximity_to_spmrkt_coef         -1459.413   2764.910   3472.396   6292.418
               proximity_to_hawker_coef         17611.884  19414.352  20485.968  21427.543
               proximity_to_parks_coef         -12665.033  -6996.662  -4522.502    612.946
               proximity_to_childcare_coef     -28332.788 -14735.485 -10825.224  28915.454
               within_350m_childcare_coef       -5256.943  -3685.762  -3104.551   -290.356
               proximity_to_busstop_coef        60219.392  61557.090  79743.184  82612.147
               within_350m_busstop_coef          1443.124   1603.663   2249.168   2330.550
               proximity_to_mall_coef           12152.795  19532.661  21506.143  26254.662
                                                     Max.
               Intercept_coef                  218226.957
               floor_area_sqm_coef               5513.990
               storey_avg_coef                   1682.731
               remaining_lease_total_mths_coef    -79.083
               proximity_to_mrt_coef            -5194.747
               proximity_to_goodprisch_coef      4240.686
               within_1km_prisch_coef            7601.878
               proximity_to_eldercare_coef       6233.049
               proximity_to_CHAS_coef            6059.762
               proximity_to_spmrkt_coef          6849.090
               proximity_to_hawker_coef         21647.147
               proximity_to_parks_coef           2172.947
               proximity_to_childcare_coef      37315.573
               within_350m_childcare_coef         681.334
               proximity_to_busstop_coef        83464.783
               within_350m_busstop_coef          2568.166
               proximity_to_mall_coef           31227.829
            
               ****************       Results of GW prediction       ******************
                                   Min.   1st Qu.    Median   3rd Qu.      Max.
               prediction        413477    423653    438323    447092    479050
               prediction_var 305213908 332954204 356909371 401502262 664804873
            
               ***********************************************************************
               Program stops at: 2024-11-11 12:52:33.44649 

Each row under the summary represents a predictor variable, with corresponding statistics for the estimated coefficients across locations:

  • Min, 1st Qu., Median, 3rd Qu., Max: These values show the distribution of coefficient estimates for each predictor variable across all locations. A wide range (difference between Min and Max) suggests substantial spatial variation in how a predictor influences resale_price in different areas.

3.7.1 Interpretations:

  • floor_area_sqm_coef: The coefficient for floor_area_sqm has a range from approximately 4,027 to 5,513, indicating that each square meter of floor area affects resale price differently across locations. Higher coefficients in some areas imply that additional floor space increases resale price more significantly there.

  • proximity_to_mrt_coef: The coefficient ranges from -17,474 to -1,608, meaning proximity to MRT (mass rapid transit) stations generally decreases resale price (negative impact). The wide range suggests that this impact is more substantial in some areas than others.

  • proximity_to_hawker_coef: The coefficient for proximity to hawker centers has a range from approximately -28,355 to -14,625. This consistent negative impact across locations could suggest that proximity to hawker centers is generally not seen as favorable for resale prices, but the extent varies geographically.

  • proximity_to_busstop_coef: The coefficients for proximity to bus stops have positive values ranging from 50.3 to 2,568. This indicates that being close to bus stops might have a small positive effect on resale prices, with stronger effects in certain locations.

Generally, a wide range of coefficient values for a predictor implies significant spatial variation, meaning the predictor’s influence on resale_price depends heavily on location.

3.7.2 Results of GWR Prediction

  • Predicted Resale Prices:

    • Min: 305,213

    • 1st Quartile: 332,925

    • Median: 435,699

    • 3rd Quartile: 498,332

    • Max: 664,408

    These values represent the distribution of the predicted resale prices across all locations based on the GWR model. They give an idea of the range of resale prices in the dataset after considering spatial effects.

7.8 Key Takeaways

  • Spatial Variation: The GWR model reveals that the impact of predictors on resale_price is not uniform across locations. This is particularly important for variables like floor_area_sqm, proximity_to_mrt, and proximity_to_hawker, which show significant variability in coefficients.

  • Local Effects: GWR captures local effects better than a global model (like ordinary least squares) by adjusting the coefficients for each location, providing insights into how property characteristics affect resale prices differently across regions.

  • Interpretation for Decision-Making: Real estate developers, urban planners, or policymakers could use this information to understand the varying influences of proximity to amenities or property features on housing prices and plan accordingly for different neighborhoods.

This output suggests that the GWR model successfully captures the local variations in resale_price, making it a suitable approach for analyzing spatially heterogeneous data like property prices.